Benchmarking German and English OCR Systems
              -------------------------------------------
                             Stefan AGNE
	             (presented by Markus JUNKER)
      German Research Center for Artificial Intelligence (DFKI) GmbH
          P.O. Box 2080, D-67608 Kaiserslautern, GERMANY
   stefan.agne@dfki.de  Tel. (+49) 631-205-3584  FAX (+49) 631-205-3210
EXTENDED ABSTRACT
In the field  of benchmarking of OCR systems, DFKI works in  two
research areas:
 - text based evaluation of document page segmentation systems and
 - character and word based evaluation of text recognition systems.
DOCUMENT PAGE SEGMENTATION
The decomposition of a document into segments such as text regions and
graphics is a significant field in the document analysis process.  The
basic  requirement for rating    and improvement of page  segmentation
algorithms is a systematical  evaluation.   The approaches known  from
literature have  the  common   disadvantage that  manually   generated
reference  data  (zoning ground truth)  is   needed for the evaluation
task.   The effort and  cost for  the  creation of appropriate  ground
truth is high.
 
At the DFKI, the evaluation system SEE has been developed.  The system
requires the OCR generated text and the  original text of the document
in correct reading order  (text ground truth)  as input.  The implicit
structure information which  is contained in the  text ground truth is
used for the evaluation of the automatic zoning.  Therefore, a mapping
of text regions in the text  ground truth to the corresponding regions
in the OCR generated  text is  computed  (matches).  A  fault tolerant
string matching algorithm  is used to obtain  a method which tolerates
OCR errors in the text.
The occuring  segmentation errors are   determined as a  result of the
evaluation  of the   matching.    Subsequently, the  edit   operations
(insertion,  substitution,  and   deletion  of  character)  which  are
necessary for the correction of the recognized segmentation errors are
computed to estimate the correction  costs.  However, first tests have
revealed promising results regarding the quality of the evaluation.
TEXT RECOGNITION
To evaluate  text  recognition systems  we  compare the generated text
(OCR text) with the correct text  (text ground truth) by computing the
edit distance (Levenshtein distance)  between both texts. As a  result
of   this comparison we   get the  necessary  minimum  number  of edit
operations to correct the OCR text.
Based  on these techniques   we have developed  a  tool to compute the
following character based evaluation measures:
 - character accuracy
 - number of errors (insertions,  substitutions,  and deletions)
 - accuracy by character class
Furthermore we have developed  a tool  to  compute several  word based
evaluation measures:
 - word accurracy
 - number of misrecognized words
 - stopword accurracy and non-stopword accurracy
 - distinct non-stopword accurracy
 - phrase accuracy
The  exact defintions of the listed  measures are shown for example in
the description of  the fifth and last annual  test of OCR accuracy at
the ISRI in the year 1996.
Currently, commercial recognizers  can  provide information  about the
confidence of a recognized word (e.g. the Xerox ScanWorX XDOC format),
but  usually they hardly   provide character alternatives. For example
Recore from NewSoft,  Inc., USA  outputs alternatives and   EasyReader
from Mimetics, France provides up to three alternatives.
The structure to  represent alternatives in  a single  character place
are called character hypothesis lattices (CHL). An example of a CHL is
shown in the following:
---------- Example for character hypothesis lattices (CHL) -----------
(c 999)
(0 456) (o 198)
(m 517)
(q 500) (p 500)
(n 334) (u 333) (h 247)
(t 818)
(e 1000)
(r 734)
.
.
.
--------------------------- End of example ---------------------------
Each  place in the     CHL denotes a  possible    recognized character
augmented with an evidence  measure. In general,  an OCR  engine takes
the  maximum  of  such a  choice  and presents   it as  the recognized
character    (assumption: no   contextual  postprocessing).   A common
observation is that  the real character  --- when not being recognized
as first choice --- is likely to be recognized as second or third best
alternative.  However, under certain  circumstances a character is not
recognized as one single character  or vice versa.   This is due to an
"incorrect" character segmentation procedure on noisy data.
In regard  to  the CHLs   we  have extended the  functionality  of our
evaluation tools. Now we compute the  character accuracy for a certain
depth, whereby  the   depth   determins   the number   of    character
alternatives taken into account.  For  example, if we  choose 3 as the
depth, then we say the character has been correctly recognized, if one
of  the first   three alternatives  is   equal  to  the ground   truth
character.
In the following, the first part of  the output of our character based
evaluation tool is shown.
------------------- Output of the evaluation tool --------------------
 DFKI Votes Accuracy Tables Version 1.0
 --------------------------------------
     917        Characters total in Ground Truth
    1.16        Votes-alternatives per Character (for 801 Non-space characters)
  Total Accuracy Table:
 ============================================================
| Depth         |      1 |      2 |      3 |      4 |     >4 |
|============================================================|
| Errors        |     44 |     32 |     31 |     30 |     29 |
|---------------|--------|--------|--------|--------|--------|
| Accuracy      |  95.20 |  96.51 |  96.62 |  96.73 |  96.84 |
 ============================================================
.
.
.
--------------------------- End of example ---------------------------
For the evaluation of multilingual  OCR, we can think  of a series  of
further extensions.  For example, we  can extend the measure "accuracy
by character class" about further language specific character classes.
For example,   a German  character class    within the German  umlauts
"äÄöÖüÜß".  Similar  extension    are  possible for the   word   based
evaluation by using  language specific stopword lists.  The  suggested
extensions are easy to realize.
LANGUAGES OF INTEREST:
Primarily the DFKI deals with German and  English documents. Usally we
have one language per document.
DATASETS
For our tests we use several datasets:
 - English Business Letter Sample (ISRI)
 - German Business Letter Sample (ISRI)
 - several internal datasets:
    + German Facsimile Sample
    + German Magazine Sample
    + Further German Business Letter Samples