Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
19982122122072071651932619785466288085997.02697.64297.33320700
2000280280270270223245220781093881111595998.02296.42997.21926721
2002354354346346316332282141486964151517898.13897.74097.93934600
2004517517508508460493274871556574158406198.26598.25998.26250800
2006514514512512480506343531917685195203898.24099.61198.92151200
2008620620616616585612475062446429249393598.09599.35598.72161600
2010641641638638601634523932625293267768698.04399.53298.78263800
2012670670670670664670552682825887288115598.082100.00099.03267000
2014744744739739738736647973081978314677597.94199.32898.63073900
total455245524506450642324421358293178893531824764698.03698.98998.511450321

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 8.83805 minutes with 8 cores

computed on: Sun Sep 25 19:24:04 CEST 2016 from: METADONNEES_LREC_140319.txt