Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
199716161616913234810204910439797.751100.00098.8631600
19998888281563476444920796.824100.00098.386800
2000363600000000.0000.0000.000000
2001262613131313869614886235798.60650.00066.3541300
20023535191913181034666696770398.47354.28669.9881801
2003353535353035212214591714803998.567100.00099.2783500
2004242424241923123110032410155598.788100.00099.3902400
2005393939393439211715751215962998.674100.00099.3323900
2006383837373236270015929716199798.33397.36897.8483700
20071311311301301281291266470939872206298.24699.23798.73913000
2008404040403540325818319618645498.253100.00099.1194000
2009464646464145358321984422342798.396100.00099.1924600
2010505050504648430025071525501598.314100.00099.1504802
2011515151514949363922941323305298.439100.00099.2134902
20121391391391391371391665493641395306798.253100.00099.11913900
2013424242424242401122748623149798.267100.00099.1264200
2014333333333233305717432017737798.277100.00099.1313300
2015535353535153492526300626793198.162100.00099.0725300
total842842775775713763700754034691410476698.29392.04395.06577005

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 2.0761333333333334 minutes with 8 cores

computed on: Sun Sep 25 18:09:30 CEST 2016 from: METADONNEES_CONLL_140407.txt