Estimation of the quality based on absence/presence of words in the TagParser lexicon.

yearnb of papers from the metadatanb of papers in PDFnb of papers in XML (= output of PDFBox)nb of non empty papers as extraction resultnb of papers with an abstract (from extraction)nb of papers with references (from extraction)nb of unknown wordsnb of known wordsnb of words of the contentevaluation of noise = pourcentage of nb of known words / nb of words of the contentevaluation of silence = pourcentage of non empty papers as extraction result / PDF docscombined evaluation of noise and silencenb of English papersnb of French papersnb of papers in another language (es+de+ru)
197928282626010529010523511052595.21492.85794.0212600
198044444141026440111777912218096.39893.18294.7634100
198136363434421606916842617449596.52294.44495.4723400
198239393838924282812810813093697.84097.43697.6383800
198325252525821444014370114814197.003100.00098.4792500
198411611611211241791563845881647445496.70496.55296.62811101
19854040353518271159725128726288495.58987.50091.3663500
1986414141411730607722416323024097.361100.00098.6634100
1987343434342130542219221919764197.257100.00098.6093400
1988353534341827694420403721098196.70997.14396.9253400
1989343434341732569021365021934097.406100.00098.6863400
1990393939392237599423495024094497.512100.00098.7403900
1991565656562041605024994825599897.637100.00098.8045600
1992545454542346518824621625140497.936100.00098.9575400
1993474747471238572722102322675097.474100.00098.7214700
1994525252521245554425704226258697.889100.00098.9335200
1995565656561144586324526325112697.665100.00098.8195600
1996585858581443668629163829832497.759100.00098.8675800
1997737373731155890337840138730497.701100.00098.8377300
199824524524324374191245921009137103372997.62199.18498.39624210
1999838383832564856038775339631397.840100.00098.9088300
2000797950504248356519911320267898.24163.29176.9854901
2001707066666165737932701933439897.79394.28696.0086600
2002656558585758521529835130356698.28289.23193.5385800
200311011010410499103867746185247052998.15694.54596.31710400
200410010098989898944951518252463198.19998.00098.0999800
2005103103103103103103839750309851149598.358100.00099.17210300
2006272272270270267269241181398064142218298.30499.26598.78227000
20071881881881881841881302182954984257098.455100.00099.22118800
20081871871871871831871251683943385194998.531100.00099.26018700
20092142142142142122141748496389098137498.218100.00099.10121400
2010230230230230229230230731256932128000598.197100.00099.09123000
2011292292292292288292254571404595143005298.220100.00099.10229200
20121871871871871851871668586568588237098.109100.00099.04618700
2013328328328328324328311451618391164953698.112100.00099.04732800
2014286286286286285286294361472744150218098.040100.00099.01128600
2015318318317317314317342731587453162172697.88799.68698.77831700
total426442644193419333083904427393202701432069753697.93598.33598.135419012

Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.

Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.

Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.

Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence

total elapsed time (read and display included)= 11.982516666666667 minutes with 8 cores

computed on: Sun Sep 25 17:48:17 CEST 2016 from: METADONNEES_ACL_140330.txt