Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
19801313131307380711770212150996.867100.00098.4091300
19811717171707547014280914827996.311100.00098.1211700
19821414141406406610214410621096.172100.00098.0491400
19831111111107451110865811316996.014100.00097.9661100
19849999081665908999256498.201100.00099.092900
198512121212012235111541611776798.004100.00098.9921200
198617171717011225214433714658998.464100.00099.2261700
198716161616016336215875016211297.926100.00098.9521600
198828282828026382423888124270598.424100.00099.2062800
198923232323021391319242819634198.007100.00098.9932300
199021212121020400316163816564197.583100.00098.7772100
199125252525023332019246719578798.304100.00099.1452500
199225252525024410620820921231598.066100.00099.0242500
199326262626126619430130830750297.986100.00098.9832600
199429292929024729129393430122597.580100.00098.7752702
199521212121021622125133425755597.585100.00098.7782100
199621212121020537220957221494497.501100.00098.7352100
199723232323022602128644329246497.941100.00098.9602300
199823232323022681529658530340097.754100.00098.8642300
199921212121020816430026930843397.353100.00098.6592100
200021212121021662628052828715497.693100.00098.8332100
200121212121021801727116627918397.128100.00098.5432100
200221212121120462122853223315398.018100.00098.9992100
200321212121021502430975631478098.404100.00099.1962100
200417171717017415221424421839698.099100.00099.0401700
200522222222021500625456025956698.071100.00099.0262200
200619191919019434323784824219198.207100.00099.0951900
200724242424022539324674325213697.861100.00098.9192301
200822222222021478027978128456198.320100.00099.1532200
200923232323023549829384929934798.163100.00099.0732300
201033333333030586036852137438198.435100.00099.2113300
201128282828028597537671738269298.439100.00099.2132800
201225252525025816843847544664398.171100.00099.0772500
2013303030300301009952390553400498.109100.00099.0453000
201429292929029853144391045244198.114100.00099.0482900
total75175175175126911848218682318886713997.916100.00098.94774803

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 4.610716666666666 minutes with 8 cores

computed on: Sun Sep 25 17:58:12 CEST 2016 from: METADONNEES_CL_140523.txt