Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
1966777701299230512335098.719100.00099.356700
19671717171703713417324244598.320100.00099.1531700
19681717151502852484894934198.27388.23592.9841500
196924242424131055813888244398.720100.00099.3562400
197018181818081429617496317897.738100.00098.8561701
197120202020091496681966969297.853100.00098.9152000
197219191919091838899719180997.998100.00098.9891900
197318181818083255894109266596.487100.00098.2121800
197425252323011266312204812471197.86592.00094.8422300
197521212121011129210079210208498.734100.00099.3632100
197625252424011246912604612851598.07996.00097.0282400
197726262626014501315786016287396.922100.00098.4372510
19783232303016587215951416538696.45093.75095.0812820
197914141313051472813768284898.22392.85795.4651300
198019191717092373957769814997.58289.47493.3521700
19811515151506395412255112650596.874100.00098.4121500
19821717171704355210936711291996.854100.00098.4021520
198312121212062949905759352496.847100.00098.3981110
19842222212109274911898712173697.74295.45596.5852100
198524242121015185810543210729098.26887.50092.5722100
1986373728280132743789568169996.64375.67684.8842260
198714141414081261857898705098.551100.00099.2701400
1988222222222117323714485214808997.814100.00098.8952110
1989343434343320552921090121643097.445100.00098.7063400
1990343433333013596721440622037397.29297.05997.1753300
1991292929292917441221543521984797.993100.00098.9862900
1992292929292721488922750723239697.896100.00098.9372900
1993353535352726460718885619346397.619100.00098.7953500
199431313131820453921341021794997.917100.00098.9483100
199530303030023488025918726406798.152100.00099.0673000
199634343434028483825296525780398.123100.00099.0533301
199727272727426355519217119572698.184100.00099.0842700
199823232323118286220460420746698.620100.00099.3052300
1999292929292421307316316416623798.151100.00099.0672900
2000292929292127146212859913006198.876100.00099.4352900
2001242423232123206215085015291298.65295.83397.2222300
2002242424241719311817981418293298.296100.00099.1402400
2003272727272224219717416917636698.754100.00099.3732700
2004232323232223320917375317696298.187100.00099.0852300
total9279279029023095371155935353698546929197.88797.30397.594887132

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 3.4494 minutes with 8 cores

computed on: Sun Sep 25 17:53:35 CEST 2016 from: METADONNEES_CATH_150112.txt