year | nb of papers from the metadata | nb of papers in PDF | nb of papers in XML (= output of PDFBox) | nb of non empty papers as extraction result | nb of papers with an abstract (from extraction) | nb of papers with references (from extraction) | nb of unknown words | nb of known words | nb of words of the content | evaluation of noise = pourcentage of nb of known words / nb of words of the content | evaluation of silence = pourcentage of non empty papers as extraction result / PDF docs | combined evaluation of noise and silence | nb of English papers | nb of French papers | nb of papers in another language (es+de+ru) |
1998 | 212 | 212 | 207 | 207 | 165 | 193 | 26197 | 854662 | 880859 | 97.026 | 97.642 | 97.333 | 207 | 0 | 0 |
2000 | 280 | 280 | 270 | 270 | 223 | 245 | 22078 | 1093881 | 1115959 | 98.022 | 96.429 | 97.219 | 267 | 2 | 1 |
2002 | 354 | 354 | 346 | 346 | 316 | 332 | 28214 | 1486964 | 1515178 | 98.138 | 97.740 | 97.939 | 346 | 0 | 0 |
2004 | 517 | 517 | 508 | 508 | 460 | 493 | 27487 | 1556574 | 1584061 | 98.265 | 98.259 | 98.262 | 508 | 0 | 0 |
2006 | 514 | 514 | 512 | 512 | 480 | 506 | 34353 | 1917685 | 1952038 | 98.240 | 99.611 | 98.921 | 512 | 0 | 0 |
2008 | 620 | 620 | 616 | 616 | 585 | 612 | 47506 | 2446429 | 2493935 | 98.095 | 99.355 | 98.721 | 616 | 0 | 0 |
2010 | 641 | 641 | 638 | 638 | 601 | 634 | 52393 | 2625293 | 2677686 | 98.043 | 99.532 | 98.782 | 638 | 0 | 0 |
2012 | 670 | 670 | 670 | 670 | 664 | 670 | 55268 | 2825887 | 2881155 | 98.082 | 100.000 | 99.032 | 670 | 0 | 0 |
2014 | 744 | 744 | 739 | 739 | 738 | 736 | 64797 | 3081978 | 3146775 | 97.941 | 99.328 | 98.630 | 739 | 0 | 0 |
total | 4552 | 4552 | 4506 | 4506 | 4232 | 4421 | 358293 | 17889353 | 18247646 | 98.036 | 98.989 | 98.511 | 4503 | 2 | 1 |
Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.
Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.
Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.
Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence
total elapsed time (read and display included)= 8.83805 minutes with 8 cores
computed on: Sun Sep 25 19:24:04 CEST 2016 from: METADONNEES_LREC_140319.txt