Evaluation and Improvement of OCR Performance for Japanese-English Mixed Text
-----------------------------------------------------------------------------

Masahiko HATA, Tetsushi WAKABAYASHI, Fumitaka KIMURA, and Yasuji MIYAKE

Faculty of Engineering, Mie University, 1515 Kamihama, Tsu, 514-8507 JAPAN
kimura@hi.info.mie-u.ac.jp TEL/FAX +81-59-231-9457


1. Introduction

Performance of existing commercial Japanese OCR software is deteriorated
when the input Japanese text includes English words, English sentences,
computer programs and commands. The performance deterioration for such
Japanse-English mixed text is mainly caused by the problems of character
segmentation and recognition of the English region.
Japanese OCR software has two reading modes, i.e. Japanese mode and
English mode. The English mode is aimed to recognize characters used in
English text (alphanumerals and symbols), while the Japanese mode is aimed
to recognize all characters used in Japanese text (alphanumerals, symbols,
Kanji, Hiragana and Katakana). Because the English mode is specialized to
segmentation and recognition of English characters, it performs better for
English region than Japanese mode does. However, the English mode is not
available for Japanese -English mixed text, thus the recognition accuracy
of the English region is relatively low.
In section 2. the accuracy of character segmentation and recognition
for Japanese-English mixed text is evaluated to reveal the problems. In
section 3. a procedure for fixed pitch region detection for improving
character segmentation in English region is described. In section 4. a
procedure to merge and correct the OCR output by Japanese mode and
English mode is described.

2. Evaluation of OCR performance for Japanese-English mixed text

Eight test sheets are used in the performance evaluation. Table 1 shows
the number of characters in each region of the test sheets.

Table 1. Number of characters in each region of Japanese-English
mixed text sheets

Test sheet English region Japanese region Total
Windows manual 215 528 743
Magazine(ASCII) 157 893 1050
Advertisement 263 972 1235
Magazine(Nikkei Byte) 643 1065 1708
Magazine(Interface) 253 295 548
Magazine(ASCII) 98 1133 1231
Magazine(Interface) 349 1317 1666
Magazine(Nikkei Byte) 190 796 986
Total 2466 8838 11304

Four typical Japanese OCR software A, B, C, D are used in the evaluation test.

Table 2 and 3 show the accuracy of character segmentation and recognition by
each OCR software for the test sheets, respectively. These tables show that
the error rates of character segmentation and recognition in English region
are nearly reduced to half by the use of the English mode. While the most
errors in Japanese region are recognition errors, about the half of the
errors in English region are segmentation errors of characters. These
results show that the accuracy improvement of character segmentation
and recognition in English region is necessary to improve total OCR
performance for Japanese-English mixed text.

Table 2. Accuracy of character segmentation for Japanese-English
mixed text (%)

Japanese region English region
OCR Japanese mode Japanese mode English mode
A 98.97 86.70 94.93
B 98.65 88.77 96.57
C 99.12 94.93 90.54
D 99.13 90.06 95.23
Average 98.97 90.12 94.32

Table 3. Accuracy of character recognition for Japanese-English
mixed text (%)

Japanese region English region
OCR Japanese mode Japanese mode English mode
A 92.88 78.85 90.44
B 91.23 80.20 92.51
C 95.89 89.38 84.73
D 88.41 79.19 87.73
Average 92.10 81.90 88.85

3. Detection of fixed pitch region

The height and width of printed Japanese characters are correlated, and
the characters are usually aligned in fixed pitch. This property can be
utilized to estimate the pitch of character alignment and to detect the
fixed pitch regions. Once the fixed pitch regions are detected, Japanese
region (with fixed pitch) and English region (with variable pitch) are
detected and separated.

3.1 Estimation of character pitch

The pitch of character alignment in each line is estimated by the following
procedure.

(1) Given a width of rectangular frame of a character, a ladder of
horizontally aligned frames is shifted from left to right. The width of
the frame ranges from 80 to 125% of the height of characters, and the
horizontal displacement of the ladder ranges from 0 to 100% of the width.
(2) The width of the frame which minimizes the number of black pixels
on the edges of the ladder found in (1) is defined as the estimated pitch.

The number of black pixels on the edges of the ladder is calculated
using horizontal pixel projection of the text line.

3.2 Detection of fixed pitch region

Shifting the ladder with estimated frame width from left to right on
the text line, a region of characters enclosed in five or more successive
frames without intersection is detected as a fixed pitch region. At both
ends of the text line, a region of characters enclosed in three or more
successive frames is detected as a fixed pitch region.

3.3 Character segmentation of Japanese-English mixed text

Characters in the fixed pitch regions are synchronously segmented with
the estimated pitch. This synchronous character segmentation avoids
mis-seperation of Kanji or Hiragana characters with disconnected left
and right parts. Characters in the variable pitch regions are segmented
asynchronously. The asynchronous character segmentation is suitable for
alphanumerals with narrow variable pitch alignment.

Table 4. shows the accuracy of character segmentation for Japanese-
English mixed text. In the region independent character segmentation,
entire text was assumed to be fixed pitch and was synchronously segmented.
In this experiment character boundaries were simply detected based on
the horizontal pixel projection of text lines both in fixed and variable
pitch regions. It is shown that the accuracy of character segmentation
in English region is significantly improved by the fixed pitch region
detection.

Table 4. Accuracy of character segmentation of Japanese-English
mixed text (%)

Alphanumeral region Japanese region Total region
Region independent 56.20 96.02 87.33
Region detection 83.90 96.16 93.49

4. String matching and correction of the OCR output

The recognition accuracy of English region can be improved by replacing
the output alphanumeral strings of Japanese mode by corresponding ones of
English mode. An output alphanumeral sting of Japanese mode is matched
against to the output of English mode by a string matching algorithm to
detect the corresponding string with minimum edit cost. In the string
matching algorithm, operations of deletion, insertion and substitution of
characters are used with fixed amount of cost. The edit cost is total of
the cost of each operation to edit an input string to the reference string.
The edit cost is minimized by the dynamic programming. The cost of
insertions preceding and succeeding the reference string is neglected to
detect the corresponding substring.
Table 5 shows the accuracy improvement by the string matching and
correction. The string matching and correction is applied to the output
alphanumeral strings of length five or more, because it is not effective
for too short character strings. Used OCR software is C in Japanese mode
and B in English mode.

Table 5. Accuracy improvement by string matching and correction (%)

Alphanumeral region Japanese region Total region
before correction 85.81 96.01 92.72
after correction 91.41 96.04 94.55

5. Conclusions

In this paper, the accuracy of character segmentation and recognition for
Japanese-English mixed text was evaluated, and a procedure for fixed pitch
region detection for improving character segmentation in English region was
described. A procedure to merge and correct the OCR output by Japanese mode
and English mode was also described. The experimental result is summarized
as follows.
(1) The performance deterioration for Japanse-English mixed text recognition
is mainly caused by the problems of character segmentation and recognition
of the English region.
(2) The detection and separation of fixed pitch region is efficient to
improve character segmentation of Japanese-English mixed text.
(3) The recognition accuracy of English region can be improved by replacing
the output alphanumeral strings of Japanese mode by corresponding ones of
English mode.

Relating to the detection of the fixed pitch region, studies on (1) accuracy
improvement of character segmentation in variable pitch region, and
(2) performance evaluation by character recognition accuracy,
are remaining as future research topics. Relating to the string matching
and correction, further studies on (1) accuracy improvement of alphanumeral
string detection, and (2) string matching and correction of short
alphanumeral strings, are remaining.