Multilingual OCR Activities and Interest at ScanSoft
     ----------------------------------------------------
                   Yang HE and Ben WITTNER
	ScanSoft, Inc., 9 Centenniel Dr, Peabody, MA 01960  USA
	          {benw,yangh}@scansoft.com
   At ScanSoft, Inc., formerly Xerox Imaging Systems, we develop and 
market a commercial OCR software called TextBridge. It currently
supports 56 different languages. They are organized in 6 groups - 
American/European, Baltic, Central European, Cyrillic, Greek, and 
Turkish. The current system can load one language group at a time 
and recognize all languages in that group on the same page. 
  We have collected large ground truth data sets for our own development
purpose. The ground truth files are in either Latin 1 Codepage or 
Unicode format. It has some mark-ups for certain format information.
But for accuracy evaluation only, the mark-ups are filtered out. 
   We have developed two different string matchers. One is 8-bit
codepage based. It tries to align OCR output lines with ground truth
lines and count both character and word errors. The other is Unicode
based that aligns the whole page of OCR output to the ground truth as
a single string. It can count character errors only at this point, 
and we have not tested it with many languages yet. For our daily 
development purpose we now convert the Unicode ground truth to its
corresponding codepage and use the first matcher.
   We are interested in all the topics listed in the workshop's
"Technical Focus". In addition, through the process of developing
and using our data/tools, we would like to learn other people's
opinion and work on how to handle the following specific issues:
* Issues associated with a Unicode based matcher that can check 
  character and word errors across all languages.
* Punctuation "normalization", i.e., how to treat punctuation
  of the same functionality but with different shape and/or code
  as the same in different languages.
* Same words with and without diacritics. For example, "resume".
* Unusual text line orientations and running directions.
* Non-unique conversion between Big5 and GB code in Traditional
  and Simplified Chinese. Can conversion be unique via Unicode?
  Non-unique Unicode/Codepage conversions in other languages?
* Furigana (the lowered and smaller Japanese characters for
  pronunciation purpose).