Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
2000434343431136476322669723146097.942100.00098.9604300
20013131181817181731980399977098.26558.06572.9961800
2003757570706569385625983826369498.53893.33395.8657000
2004828278787878451334262034713398.70095.12296.8787800
2006113113112112111111693543198243891798.42099.11598.76611200
2007127127127127126126777552730053507598.547100.00099.26812700
20091461461431431421421025859133160158998.29597.94598.12014300
20101461461461461431451083764872565956298.357100.00099.17214600
2012979797979597866949247850114798.270100.00099.1289700
20131401401401401401401289470428971718398.202100.00099.09314000
20151861861861861851861890394031495921798.029100.00099.00518600
total118611861160116011131148911345263613535474798.29897.80898.052116000

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 2.6649000000000003 minutes with 8 cores

computed on: Sun Sep 25 19:30:34 CEST 2016 from: METADONNEES_NAACL_140425.txt