Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
1987252252250250159190759941993342753298.22399.20698.71224505
19893603603603602573041613695167496781098.333100.00099.15936000
199034934934934930331517633998509101614298.265100.00099.12534900
19913353353353353013181576497203098779498.404100.00099.19633500
1992413413412412370387227511305298132804998.28799.75899.01741200
1993527527526526510498224761491057151353398.51599.81099.15852600
1994560560559559509526255611650082167564398.47599.82199.14355900
1995518518514514474493218071400293142210098.46799.22898.84651400
1996635635628628589605287871724620175340798.35898.89898.62762701
1997722722715715669692292471927472195671998.50599.03098.76771500
1998850850836836784822348212375046240986798.55598.35398.45483600
1999722722686686651666323261896937192926398.32495.01496.64168600
2000922922879879811829377712487327252509898.50495.33696.89487900
2001672672653653607613278861984101201198798.61497.17397.88865300
2002674674672672649630725592078547215110696.62799.70398.14167200
2003798798797797783762877652534563262232896.65399.87598.23879700
2004775775772772755760335552335583236913898.58499.61399.09677200
2005869869869869858835978722864680296255296.696100.00098.32086900
2006659659659659659632617832161072222285597.221100.00098.59165900
2007751751747747746746368862498362253524898.54599.46799.00474700
2008762762760760755753364072400777243718498.50699.73899.11876000
2009765765765765765765391002638698267779898.540100.00099.26576500
2010781781781781780779393092699078273838798.565100.00099.27778100
2011846846846846845846452642923421296868598.475100.00099.23284600
2012679679678678676677365002215552225205298.37999.85399.11167800
2013756756756756756756464882809819285630798.372100.00099.18075600
2014634634634634629631389862334683237366998.358100.00099.17263400
2015777777777777775776503032867558291786198.276100.00099.13177700
total1836318363182151821517425176061063342569467725801011498.16799.19498.6781820906

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 29.562483333333333 minutes with 8 cores

computed on: Sun Sep 25 19:12:47 CEST 2016 from: METADONNEES_ISCA_140509.txt