Estimation of the quality based on absence/presence of words in the TagParser lexicon.

year	nb of papers from the metadata	nb of papers in PDF	nb of papers in XML (= output of PDFBox)	nb of non empty papers as extraction result	nb of papers with an abstract (from extraction)	nb of papers with references (from extraction)	nb of unknown words	nb of known words	nb of words of the content	evaluation of noise = pourcentage of nb of known words / nb of words of the content	evaluation of silence = pourcentage of non empty papers as extraction result / PDF docs	combined evaluation of noise and silence	nb of English papers	nb of French papers	nb of papers in another language (es+de+ru)
2004	1	1	1	1	0	1	166	8361	8527	98.053	100.000	99.017	1	0	0
2005	5	5	5	5	0	5	819	56537	57356	98.572	100.000	99.281	5	0	0
2006	7	7	7	7	0	7	1247	80964	82211	98.483	100.000	99.236	7	0	0
2007	12	12	12	12	0	12	2775	146159	148934	98.137	100.000	99.060	12	0	0
2008	3	3	3	3	0	3	1043	38501	39544	97.362	100.000	98.664	3	0	0
2009	2	2	2	2	0	2	284	24987	25271	98.876	100.000	99.435	2	0	0
2010	3	3	3	3	0	3	636	29316	29952	97.877	100.000	98.927	3	0	0
2011	20	20	20	20	0	20	3803	224326	228129	98.333	100.000	99.159	20	0	0
2012	8	8	8	8	0	8	1853	103243	105096	98.237	100.000	99.111	8	0	0
2013	21	21	21	21	0	21	4995	302824	307819	98.377	100.000	99.182	21	0	0
total	82	82	82	82	0	82	17621	1015218	1032839	98.294	100.000	99.140	82	0	0

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 0.5135 minutes with 8 cores

computed on: Sun Sep 25 17:48:48 CEST 2016 from: METADONNEES_ACMTSLP_150730.txt