|year||nb of papers from the metadata||nb of papers in PDF||nb of papers in XML (= output of PDFBox)||nb of non empty papers as extraction result||nb of papers with an abstract (from extraction)||nb of papers with references (from extraction)||nb of unknown words||nb of known words||nb of words of the content||evaluation of noise = pourcentage of nb of known words / nb of words of the content||evaluation of silence = pourcentage of non empty papers as extraction result / PDF docs||combined evaluation of noise and silence||nb of English papers||nb of French papers||nb of papers in another language (es+de+ru)|
Note#1: the unknowns with initiale lower-case letter denote some full-size unknown words but often words which have been cut due to bad interpreation of PDF multi-columns by PDFBox.
Note#2: a paper without any content (i.e. without any body) is not taken in the pipeline. This situation may be the consequence of a processing problem or may be it is an invited presentation without any text.In constrast, a paper without abstract is taken.
Note#3: a paper without any content holds an entry in the metadata, and normally, each paper has an entry in the metadata.
Note#4: the combined evaluation is computed as: 2*EvalNoise*EvalSilence / EvalNoise+EvalSilence
total elapsed time (read and display included)= 0.7241166666666667 minutes with 8 cores
computed on: Sun Sep 25 17:50:08 CEST 2016 from: METADONNEES_ANLP_140426.txt