Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
1983323231311627601514744515346096.08096.87596.4763100
1985404037371628634518551119185696.69392.50094.5503700
1987505048481641690121661122351296.91296.00096.4544800
1989424242422235628321201421829797.122100.00098.5404200
1991555554542748763623445124208796.84698.18297.5095400
1993666666661649981636626537608197.390100.00098.6786501
1995454545452432666623081123747797.193100.00098.5774500
1997737373731155890337840138730497.701100.00098.8377300
1999525251512140495118949919445097.45498.07797.7645100
2003848484847583561231522632083898.251100.00099.1188400
2006525252524952433928352728786698.493100.00099.2415200
200910010010010098991137160676461813598.160100.00099.07210000
2012858585858282894652143453038098.313100.00099.1498500
20141241241231231221231093759187560281298.18699.19498.68712300
total9009008918915957941047214479834458455597.71699.00098.35489001

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 2.4503166666666667 minutes with 8 cores

computed on: Sun Sep 25 18:16:06 CEST 2016 from: METADONNEES_EACL_140425.txt