Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
19869999891553755367708997.985100.00098.982900
1987191919191918272014492114764198.158100.00099.0701900
1989212121212119247715820816068598.458100.00099.2232100
1990181818181718223116097616320798.633100.00099.3121800
1991212121211817255615722115977798.400100.00099.1942100
1992181818181518226814252914479798.434100.00099.2111800
1993181818181818267416651916919398.420100.00099.2031800
1994202020202020271217984118255398.514100.00099.2522000
1995171717171617239316806317045698.596100.00099.2931700
1996151515151515214112991513205698.379100.00099.1831500
1997151515151415271513941914213498.090100.00099.0361500
1998222222222022344220356020700298.337100.00099.1622200
1999181818181818354519271919626498.194100.00099.0891800
2000181818181718282518086118368698.462100.00099.2251800
2001181818181818320919677219998198.395100.00099.1911800
2002212121212021358525361325719898.606100.00099.2982100
2003191919191414408317602218010597.733100.00098.8531900
20042020202012506219834720340997.511100.00098.7402000
20052525252500670726469727140497.529100.00098.7492500
20062727272717692929655830348797.717100.00098.8452700
200735353535020928040180841108897.743100.00098.8583500
200823232323013543823782424326297.765100.00098.8702300
200928282828017715330920331635697.739100.00098.8572800
2010414141410271092544462945555497.602100.00098.7864100
2011404040403439981948744549726498.025100.00099.0034000
2012232323231213656325418826075197.483100.00098.7252300
20136565656530302078577531579610097.389100.00098.6776500
20147171717131362178181847484025597.408100.00098.6877100
20155757575721241739569361771101297.553100.00098.7625700
total7627627627624185231749668008800818376697.862100.00098.91976200

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 4.151116666666667 minutes with 8 cores

computed on: Sun Sep 25 18:13:39 CEST 2016 from: METADONNEES_CSAL_141118.txt