Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
19872323232308401445014490299.107100.00099.5512201
19892525252507435470604749599.084100.00099.5402500
19912121212117171039807108174998.729100.00099.3602100
199323232323112509617436225299.182100.00099.5892300
19952626262659872850888596098.986100.00099.4902600
1997363635351622117511193911311498.96197.22298.0843500
1999868686867575465529865530331098.465100.00099.2278402
2001686863636357424521409221833798.05692.64795.2756300
2003727271716667514228580929095198.23398.61198.4227100
20051051051021029699867343006943874298.02397.14397.58110200
2007737373736771660238692439352698.322100.00099.1547300
2009737371715859458327196927655298.34397.26097.7997100
2011737372727171545731320631866398.28898.63098.4597200
2013515151514647418321594222012598.100100.00099.0415001
2015404040403433302418601018903498.400100.00099.1943901
total795795782782615654509953033717308471298.34798.36598.35677705

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 1.6102833333333335 minutes with 8 cores

computed on: Sun Sep 25 19:27:33 CEST 2016 from: METADONNEES_MTS_150802.txt