Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
20041111011668361852798.053100.00099.017100
2005555505819565375735698.572100.00099.281500
20067777071247809648221198.483100.00099.236700
200712121212012277514615914893498.137100.00099.0601200
20083333031043385013954497.362100.00098.664300
2009222202284249872527198.876100.00099.435200
2010333303636293162995297.877100.00098.927300
201120202020020380322432622812998.333100.00099.1592000
2012888808185310324310509698.237100.00099.111800
201321212121021499530282430781998.377100.00099.1822100
total82828282082176211015218103283998.294100.00099.1408200

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 0.5135 minutes with 8 cores

computed on: Sun Sep 25 17:48:48 CEST 2016 from: METADONNEES_ACMTSLP_150730.txt