Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
199614141414310247710229310477097.636100.00098.8041400
199723232323518279714466614746398.103100.00099.0432300
1998121211118112158693957155396.98491.66794.2501100
1999353535351127419418067618487097.731100.00098.8533500
200027272626922370513615813986397.35196.29696.8212600
20012121101010101092493535044597.83547.61964.0591000
2002414133333132381618154718536397.94180.48888.3613300
2003282825252525172412306012478498.61889.28693.7202500
2004565653535353400828377628778498.60794.64396.5845300
2005127127126126123126962468120569082998.60799.21398.90912600
2006737372727172702440385941088398.29198.63098.4607200
20071311311301301281291266470939872206298.24699.23798.73913000
20081141141141141141141182967974869157798.290100.00099.13711400
2009163163163163161163180061032556105056298.286100.00099.13616300
20101251251241241221241413980000581414498.26399.20098.72912400
201114914914814814814818244990581100882598.19299.32998.75714800
20121391391391391371391665493641395306798.253100.00099.11913900
2013205205205205203205239261187966121189298.026100.00099.00320500
2014225225225225223224265341296916132345097.995100.00098.98722500
2015312312312312311312332781547331158060997.895100.00098.93631200
total202020201988198818961964217893115369021175479598.14698.41698.281198800

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 5.9311 minutes with 8 cores

computed on: Sun Sep 25 18:22:02 CEST 2016 from: METADONNEES_EMNLP_140410.txt