Estimation of the quality based on absence/presence of words in the TagParser lexicon.

year	nb of papers from the metadata	nb of papers in PDF	nb of papers in XML (= output of PDFBox)	nb of non empty papers as extraction result	nb of papers with an abstract (from extraction)	nb of papers with references (from extraction)	nb of unknown words	nb of known words	nb of words of the content	evaluation of noise = pourcentage of nb of known words / nb of words of the content	evaluation of silence = pourcentage of non empty papers as extraction result / PDF docs	combined evaluation of noise and silence	nb of English papers	nb of French papers	nb of papers in another language (es+de+ru)
1983	33	33	29	29	10	17	6315	157073	163388	96.135	87.879	91.822	28	0	1
1988	34	34	32	32	19	28	3927	164755	168682	97.672	94.118	95.862	32	0	0
1992	45	45	44	44	10	26	4963	190261	195224	97.458	97.778	97.618	44	0	0
1994	46	46	46	46	21	33	3664	159923	163587	97.760	100.000	98.867	46	0	0
1997	74	74	74	74	10	47	7401	293470	300871	97.540	100.000	98.755	72	0	2
2000	46	46	46	46	20	39	4442	215134	219576	97.977	100.000	98.978	46	0	0
total	278	278	271	271	90	190	30712	1180616	1211328	97.465	97.482	97.473	268	0	3

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 0.7241166666666667 minutes with 8 cores

computed on: Sun Sep 25 17:50:08 CEST 2016 from: METADONNEES_ANLP_140426.txt