Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
20031818171716171430817728320298.28194.44496.3251700
200421212020202014419873210017398.56195.23896.8712000
2005333332323030200614954715155398.67696.97097.8163101
2006272726262223137610110810248498.65796.29697.4632402
20072222212117181329820628339198.40695.45596.9081902
2008212121212020162510441010603598.467100.00099.2282100
20091818181817171177714997267698.380100.00099.1841800
2010151515151313947607326167998.465100.00099.2261500
2011212121211919156310095510251898.475100.00099.2322100
2012202019191617906696307053698.71695.00096.8221900
20132121212119191270801278139798.440100.00099.2142100
20142525252524241527964949802198.442100.00099.2152500
total262262256256233237165971097068111366598.51097.71098.10825105

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 0.6162333333333333 minutes with 8 cores

computed on: Sun Sep 25 17:49:24 CEST 2016 from: METADONNEES_ALTA_140426.txt