Estimation of the quality based on absence/presence of words in the TagParser lexicon.

year	nb of papers from the metadata	nb of papers in PDF	nb of papers in XML (= output of PDFBox)	nb of non empty papers as extraction result	nb of papers with an abstract (from extraction)	nb of papers with references (from extraction)	nb of unknown words	nb of known words	nb of words of the content	evaluation of noise = pourcentage of nb of known words / nb of words of the content	evaluation of silence = pourcentage of non empty papers as extraction result / PDF docs	combined evaluation of noise and silence	nb of English papers	nb of French papers	nb of papers in another language (es+de+ru)
1991	36	36	36	36	1	21	2575	130164	132739	98.060	100.000	99.021	36	0	0
1992	40	40	40	40	5	29	3543	133190	136733	97.409	100.000	98.687	40	0	0
1993	31	31	31	31	2	18	4196	147842	152038	97.240	100.000	98.601	30	0	1
1995	20	20	20	20	0	16	3922	130499	134421	97.082	100.000	98.520	20	0	0
1998	22	22	22	22	3	18	2609	101786	104395	97.501	100.000	98.735	22	0	0
total	149	149	149	149	11	102	16845	643481	660326	97.449	100.000	98.708	148	0	1

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 0.36013333333333336 minutes with 8 cores

computed on: Sun Sep 25 19:27:54 CEST 2016 from: METADONNEES_MUC_140424.txt