Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
19827777771715590696078497.179100.00098.569700
19837777711770703657213597.546100.00098.758700
19847777771406627676417397.809100.00098.892700
19855555551029421454317497.617100.00098.794500
19867777771510627186422897.649100.00098.811700
198710101010109857452934615098.143100.00099.0631000
1988111111118101046463444739097.793100.00098.8841100
19897777771135410394217497.309100.00098.636700
1990161616161614178910969411148398.395100.00099.1911600
1991111111111110640503905103098.746100.00099.3691100
1992555555931605436147498.486100.00099.237500
1993242423232322298714599214897997.99595.83396.9022300
1994202020202020151912609712761698.810100.00099.4012000
1995161616161616142311591011733398.787100.00099.3901600
1996141414141414198511002811201398.228100.00099.1061400
1997161616161516297712000812298597.579100.00098.7751600
1998121212121212323711397311721097.238100.00098.6001200
1999252525252525532521778122310697.613100.00098.7922500
2000171717171717305713108813414597.721100.00098.8471700
200116161111661312684616977398.12068.75080.8501100
2002373737373737704730526931231697.744100.00098.8593700
2003282828281921438726366226804998.363100.00099.1752800
200434343434160747733897634645397.842100.00098.9093400
20052828282801560827597028157898.008100.00098.9942800
20068888031298580975939597.815100.00098.895800
2007666603986631756416198.463100.00099.226600
20081212121208280112394712674897.790100.00098.8831200
20099999041699782057990497.874100.00098.925900
20101414141406258617426217684898.538100.00099.2631400
20111111111108204112546212750398.399100.00099.1931100
20124444031306643206562698.010100.00098.995400
20131212121209317011194511511597.246100.00098.6041200
2014777777772511940980259282200197.639100.00098.8057700
2015606060600351830765243867074597.271100.00098.6166000
total5935935875873124191157725238025535379797.83898.98898.41058700

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 2.6957999999999998 minutes with 8 cores

computed on: Sun Sep 25 19:38:07 CEST 2016 from: METADONNEES_SPEECH_141120.txt