Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
196524242424615594215211215805496.241100.00098.0842220
19673737282809681714567915249695.53075.67684.4512440
197362625151015731718470219201996.18982.25888.6804650
19809292898916731380840083841464696.67096.73996.7058720
1982141141107107256914025321826235896.51675.88784.9679836
198411611611211241791563845881647445496.70496.55296.62811101
198615615613413431892601659622662224295.81985.89790.58713121
198816616614614634863310369473472783795.45287.95291.54814600
1990195195189189451282644669115571760196.31596.92396.61818810
1992206206176176391094360083770988130995.05385.43789.98917240
199421221214214246773948662472666421294.05566.98178.24214200
1996213213196196621115170690521195691794.59792.01993.29019411
199824524524324374191245921009137103372997.62199.18498.39624210
20001751751661661501621725469154870880297.56694.85796.19216510
20021981981551551401441283363206364489698.01078.28387.04315500
20042042041831831741781486282154183640398.22389.70693.77118201
2006272272270270267269241181398064142218298.30499.26598.78227000
20081951951931931871911621590253991875498.23598.97498.60319300
2010350350350350340348327761793043182581998.205100.00099.09435000
2012332332329329327327380481933615197166398.07099.09698.58132900
2014221221221221216218264411295185132162697.999100.00098.99022100
total381238123504350421972875486158164218611690801997.12591.92094.45134682610

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 9.230916666666667 minutes with 8 cores

computed on: Sun Sep 25 18:07:25 CEST 2016 from: METADONNEES_COLING_140407.txt