Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
1992282828281024399614521314920997.322100.00098.6432800
1993292929291322449116626317075497.370100.00098.6672900
1994282827271425267713211213478998.01496.42997.2152601
1995303028281625369614498114867797.51493.33395.3782800
1996454545451739410116791017201197.616100.00098.7944401
1997585857572849549822028322578197.56598.27697.9195700
1998616161612855483726653227136998.218100.00099.1016100
1999757572723865519931288731808698.36696.00097.1687200
2000737339392738237816937417175298.61553.42569.3043900
2001858560603258305924985225291198.79070.58882.3416000
200210010086864683446136461536907698.79186.00091.9538600
2003999994944586640641194941835598.46994.94996.6779400
20041051051011015394607844979445587298.66796.19097.41310100
200513113112812878123777551146551924098.50397.71098.10512800
200612112112112183118655348807549462898.675100.00099.33312100
20071061061061067198703741690742394498.340100.00099.16310600
2008616161614155529326654827184198.053100.00099.0176001
2009777777775869547932030232578198.318100.00099.1527601
2010676767675563388125946926335098.526100.00099.2586700
201112512512412491116747040455141202198.18799.20098.69112400
2012969696966483641531833332474898.025100.00099.0029600
2013727272725967406521624022030598.155100.00099.0697200
2014848484846481505525838626344198.081100.00099.0318400
2015919191916683471126816927288098.274100.00099.1299001
total1847184717541754109716191206116930210705082198.28994.96596.599174905

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 3.453733333333333 minutes with 8 cores

computed on: Sun Sep 25 20:06:43 CEST 2016 from: METADONNEES_TREC_140901.txt