Estimation of the quality based on absence/presence of words in the TagParser lexicon.

year	nb of papers from the metadata	nb of papers in PDF	nb of papers in XML (= output of PDFBox)	nb of non empty papers as extraction result	nb of papers with an abstract (from extraction)	nb of papers with references (from extraction)	nb of unknown words	nb of known words	nb of words of the content	evaluation of noise = pourcentage of nb of known words / nb of words of the content	evaluation of silence = pourcentage of non empty papers as extraction result / PDF docs	combined evaluation of noise and silence	nb of English papers	nb of French papers	nb of papers in another language (es+de+ru)
2002	96	96	96	96	85	86	6391	279006	285397	97.761	100.000	98.868	5	91	0
2004	120	120	120	120	112	101	8235	365315	373550	97.795	100.000	98.885	2	116	2
2008	104	104	102	102	101	66	12035	328749	340784	96.468	98.077	97.266	0	102	0
2012	108	108	108	108	107	85	7486	369330	376816	98.013	100.000	98.997	2	106	0
2014	78	78	78	78	78	74	5905	292258	298163	98.020	100.000	99.000	0	78	0
total	506	506	504	504	483	412	40052	1634658	1674710	97.608	99.605	98.596	9	493	2

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 1.0761833333333333 minutes with 8 cores

computed on: Sun Sep 25 19:13:52 CEST 2016 from: METADONNEES_JEP_230614.txt