Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
19751101101091092689514729179229693998.26799.09198.67710009
19761101101101101799480831256731737598.485100.00099.23710307
197711311311311324100454628775429230098.445100.00099.216103010
197812312212212242107515835718436234298.576100.00099.28311309
197913713713713743115702144034944737098.431100.00099.20913106
198013913913913951123759345563846323198.361100.00099.17413306
1981206206206206631811188670792371980998.349100.00099.167184022
198214514514514538136936049905550841598.159100.00099.071133012
1983230230230230702061561287964089525298.256100.00099.120218012
1984194194194194691611418077163878581898.196100.00099.090182012
1985257257257257101224203691095181111555098.174100.00099.079240017
1986219219219219141205203801185488120586898.310100.00099.14821405
1987251251251251142243175621239452125701498.603100.00099.29724308
1988243243243243150231198861390118141000498.590100.00099.29023706
1989271271271271128258239221585311160923398.513100.00099.25126308
1990264264264264162252235561608761163231798.557100.00099.273253011
1991382382382382160373428292204895224772498.095100.00099.038358024
1992369369369369209360468382377275242411398.068100.00099.024357012
1993484848482646561635729336290998.453100.00099.2204800
1994737373733670630948877749508698.726100.00099.3596805
1995535353532750520236867737387998.609100.00099.2995102
1996515151512748465630265330730998.485100.00099.2375001
1997606060605060315634646334961999.097100.00099.5476000
1998575757574657272034892935164999.227100.00099.6125700
1999747474746774355945169145525099.218100.00099.6087400
2000808080807479600450806051406498.832100.00099.4138000
2001909090908889734660483961218598.800100.00099.3969000
2002616161615959405340653241058599.013100.00099.5046001
2003767676767575534352506853041198.993100.00099.4947600
2004555555555252365736825737191499.017100.00099.5065500
2005112112112112110110889578439479328998.879100.00099.43611200
2006205205205205199200169091535014155192398.910100.00099.45220401
2007224224224224221222189911716351173534298.906100.00099.45022400
2008145145145145144144132701167313118058398.876100.00099.43514500
2009137137137137134134127801084131109691198.835100.00099.41413700
2010186186186186176181156941473423148911798.946100.00099.47018600
2011222222222222221222226431818437184108098.770100.00099.38122200
2012230230230230224225206191860660188127998.904100.00099.44922901
2013222222222222221221206381795624181626298.864100.00099.42922200
2014182182182182179181205541508607152916198.656100.00099.32318101
2015198198198198196198211751688405170958098.761100.00099.37719800
total660466036602660242886260550442391996193975006198.61599.98599.29563940208

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 20.656616666666665 minutes with 8 cores

computed on: Sun Sep 25 20:03:00 CEST 2016 from: METADONNEES_TASLP_20150209.txt