Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
19886644221165196262079194.39766.66778.145040
198910101010762609484405104994.889100.00097.378091
1990333322974180711904594.886100.00097.376030
1991111188343540476125115293.07972.72781.654080
19926666432906385514145792.990100.00096.368060
199311111111372864460914895594.150100.00096.9870101
19947766124024554695949393.23685.71489.317060
19956666251534262802781494.485100.00097.164051
1996444412517118421235995.817100.00097.864040
19979999382334566235895796.041100.00097.981090
1998333313880178701875095.307100.00097.597030
199913131212465417782638368093.52792.30892.9130120
20005555325709174974494.150100.00096.987050
200114141313395291829678825894.00592.85793.4280130
2002333312549198212037097.305100.00098.634030
2003444431621182401886196.707100.00098.326040
20046666451445426524409796.723100.00098.334150
20051818151514122062719657402797.21583.33389.7401140
200618181818710579611789612369295.314100.00097.6011170
200729291919881895417104360595.65465.51777.768568
2008212119191516345514592614938197.68790.47693.9430190
20099999414982720617704393.533100.00096.659090
201016161414782053562855833896.48187.50091.7711130
total232232207207102124574831143435120091895.21389.22492.122918711

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 0.8095666666666667 minutes with 8 cores

computed on: Sun Sep 25 19:25:56 CEST 2016 from: METADONNEES_MODULAD_150413.txt