Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
200140403939323818429898910083198.17397.50097.8353900
2004646456564955250715123015373798.36987.50092.6175600
2007110110109109100109417129242529659698.59499.09198.84210900
20101001001001009899472027484627956698.312100.00099.14910000
2012109109109109107109776641175141951798.149100.00099.06610900
20131581581581581531571116354798155914498.004100.00098.99215800
20141701701701701681701011757632358644098.275100.00099.13017000
20151981981981981971981241168454869695998.219100.00099.10219800
total949949939939904935546973038093309279098.23198.94698.58893900

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 1.5348666666666666 minutes with 8 cores

computed on: Sun Sep 25 19:35:26 CEST 2016 from: METADONNEES_SEM_140514.txt