Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
1997443323372140561442897.42275.00084.753030
1998141413138101614562975791197.21392.85794.9850130
1999444442422528610918782219393196.85095.45596.14710320
20002120202014171766895829134898.067100.00099.02410100
2001424240401314609914708815318796.01995.23895.62714251
2002414141412635457516707817165397.335100.00098.6493380
2003464645452243378417144917523397.84197.82697.8335400
2004606058585355368922697423066398.40196.66797.5267510
200561616060257399823285723685598.31298.36198.3367530
2006626262626262546728676929223698.129100.00099.0567550
2007686868686867646031907332553398.016100.00098.9984640
2008525252525251494325176825671198.074100.00099.0282500
2009787878787878599034945335544398.315100.00099.1502760
2010767676767572608330653631261998.054100.00099.0188680
2011858585858583735544241544977098.365100.00099.1766790
2012535353535348470825196025666898.166100.00099.0740530
2013717161615252788335356736145097.81985.91591.4829502
2014696969696968790737969138759897.960100.00098.9694650
2015727272727272741637791338532998.075100.00099.0284680
total10191018998998831915962184612348470856697.95798.03597.9961028933

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 2.768333333333333 minutes with 8 cores

computed on: Sun Sep 25 19:42:21 CEST 2016 from: METADONNEES_TALN_140402.txt