Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
19902362362272272270166278802804699.40896.18697.77122700
19912922922842842840231358153604699.35997.26098.29928400
19922492492422422420249316373188699.21997.18998.19324200
19932172172172172170215304973071299.300100.00099.64921700
19942552552492492490248332633351199.26097.64798.44724900
19952972972972972772561184376463377647698.475100.00099.23229700
19962972972972972832721134975132376267298.512100.00099.25029700
19973743743723723632731696477023378719797.84599.46598.64837200
19983283283273273212551415875649777065598.16399.69598.92332700
19992812812792792742471350173552574902698.19899.28898.74027900
20003183183163163112621480379114680594998.16399.37198.76331600
20012992992972972912411186071048372234398.35899.33198.84229700
20023543543483483342741682777858379541097.88498.30598.094336012
2003393393392392389353173831047746106512998.36899.74699.05239200
2004395395395395391370191901120434113962498.316100.00099.15139500
2005381381372372371361189231114738113366198.33197.63897.98337200
2006435435435435432420243561339456136381298.214100.00099.09943500
2007369369369369368360249521176460120141297.923100.00098.95136900
2008414414414414413389243381280999130533798.136100.00099.05941400
2009394394394394393357223231177342119966598.139100.00099.06139400
2010423423422422422412279361370261139819798.00299.76498.87542200
2011458458458458457444315861477835150942197.907100.00098.94345800
2012409409408408408392269601336534136349498.02399.75698.88240800
2013577577577577577571465222049257209577997.780100.00098.87857700
2014577577577577577563475522054353210190597.738100.00098.85657700
2015796796793793788784845242937565302208997.20399.62398.39879300
total981898189758975896597856528959257004952622945497.98399.38998.6819746012

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 13.1149 minutes with 8 cores

computed on: Sun Sep 25 18:39:51 CEST 2016 from: METADONNEES_ICASSPS_141030.txt