Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
2006252525252525405825602926008798.440100.00099.2144210
2007212121212121290820467920758798.599100.00099.2954170
2008242424242424545127085227630398.027100.00099.0043210
2009272727272727621630296630918297.990100.00098.9858190
2010141414141414282815445215728098.202100.00099.0932120
2011202020202020594724114424709197.593100.00098.7824160
2012141414141414290316498016788398.271100.00099.128770
2013121212121212239014346114585198.361100.00099.1741110
2014111111111111234412671512905998.184100.00099.0840110
20159999992077951769725397.864100.00098.921180
total177177177177177177371221960454199757698.142100.00099.062341430

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 1.0819 minutes with 8 cores

computed on: Sun Sep 25 19:39:35 CEST 2016 from: METADONNEES_TAL_140604.txt