Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
1995373737372336268012668812936897.928100.00098.9533700
1996515151514149597821609222207097.308100.00098.6365100
199837373636936414020045720459797.97797.29797.6363600
1999363636363133416215932316348597.454100.00098.7113600
2000414141412839577821088521666397.333100.00098.6494001
2001323232321932388217168017556297.789100.00098.8823200
2002444444441343607721168021775797.209100.00098.5854400
2003515150502249696622107522804196.94598.03997.4895000
2004323230302729387113155313542497.14293.75095.4163000
2005353534342933346514820415166997.71597.14397.4283400
2006717170706870371721082121453898.26798.59298.4297000
2007565655555555601622503923105597.39698.21497.8045500
2008515151514950507224286724793997.954100.00098.9675100
2009989898989798815144337245152398.195100.00099.0899800
2010102102102102981001032841843742876597.591100.00098.78110002
2011676767676667714630717431432097.727100.00098.8506700
2012707070706468937336389837327197.489100.00098.7297000
2013545454545154679228140428819697.643100.00098.8085400
2014747474747172861038552939413997.815100.00098.8967400
total103910391032103286110131122044676178478838297.65799.32698.484102903

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 2.5151 minutes with 8 cores

computed on: Sun Sep 25 19:33:05 CEST 2016 from: METADONNEES_PACLIC_140429.txt