Classification and Identification of Multi-lingual Documents
      ------------------------------------------------------------
              Jie Ding, Louisa Lam, and Ching Suen
   Centre for Pattern Recognition and Machine Intelligence, Suite GM-606
        Concordia University, 1455 de Maisonneuve Blvd West
               Montreal, Quebec H3G 1M8, CANADA
       llam@ied.edu.hk Tel. +852-2948-7808 FAX +852-2948-8014
   Language classification (LC) refers to the categorization of text 
documents into different natural language groups, whereas language 
identification (LI) determines the language used in a document. LC 
and LI play important roles in document processing systems, because 
they can perform initial classifications to reduce the scope for 
subsequent stages of processing [1, 2]. This study addresses these
2 topics in the following manner:
 (1) LC of documents written in 24 languages into two language 
     categories (oriental and European), and 
 (2) LI of oriental documents into Chinese, Japanese and Korean.
   Statistical features have been explored to differentiate between 
documents printed in various natural languages. A total of 6 
distinctive features are proposed, of which 3 are used for LC, 
viz. horizontal projection profiles, height distributions of 
connected components (CC) and enclosing structure of connected 
components. Experimental results show that we are able to classify 
the script of a document as either European or Asian based on four 
50-CCs and obtain a high recognition rate while keeping rejection rate
at a low level.
   In the LI of oriental documents, the complexity of structure, Korean
'circles' and vertical strokes have been chosen as features to 
distinguish among the three language scripts. The identification 
has been made according to the values of these features, and also by 
a clustering algorithm.
   For experimental studies, seven hundred documents were collected in 
CENPARMI. The recognition rates achieved in LC and LI have exceeded 
95% and 94%, with error rates below 2% and 4.5%, respectively.
                                   
Results of language classification by using one 50-component:
Language   Samples    Not processed    Recognition(%)   Error(%)  Reject(%)
===========================================================================
European   262         0               95.32            4.68      0.00
Chinese    181         0               98.34            1.66      0.00
Japanese    84         0               99.21            0.79      0.00
Korean     154         0               97.62            2.38      0.00      
Results of language classification by using three 50-components
Language   Samples Not processed Recognition(%)  Error(%)  Reject(%)
===========================================================================
European   262         2               98.08            1.92      0.00 
Chinese    181         2              100.0             0.00      0.00 
Japanese    84         0              100.0             0.00      0.00
Korean     154         5              100.0             0.00      0.00        
Results of language classification by using four 50-components:
Language   Samples  Not processed    Recognition(%)  Error(%)  Reject(%)
===========================================================================
European    262      5               99.22           0.00      0.78 
Chinese     181      4              100.0            0.00      0.00 
Japanese     84      0              100.0            0.00      0.00 
Korean      154     10              100.0            0.00      0.00   
The above tables are the results of LC on the basis of one 50-component,
three 50 components and four 50 components, respectively. For the last two
cases, decisions are made by majority voting of the 3 or 4 sets of 
50 components [3]. During the process, we randomly select a relatively 
long text line and if necessary, several lines are concatentated to obtain 
a 50 components. As the generation of one set of 50 components depends 
on random selection, we do not rely on the outcome of only one trial. So in 
the differentiation by using only one set of 50 components, we test the 
data set several times and average the results over three trials in order 
to reduce the element of chance. The results show that the error rates 
are relatively higher when 1 or 3 units of 50-components are considered, 
while the rejection rates are higher when an even number of units of 
50-components are used. 
   Analyses of the results indicate that the European documents tend to be 
mis-classified as Oriental ones if:
(1) The quality of the document is poor, either because many characters 
    are broken or when some characters touch each other.
(2) Documents written in certain fonts that are closer to handwritting than
    machines printed.
   On the other hand, Oriental documents tend to be classified as European 
if more than 20% of the characters in the document belong to a foreign
language.
I   n order to take care of those documents not process in LC, the scripts
are further processed by an LI which makes use of 3 principal features,
viz. 
      (i)   complexity of the character 'C', 
      (ii)  circles/ellipses 'K', and
      (iii) vertical strokes 'V'. 
   Examination of the training data indicates that Korean documents have 
"high" K and V values, while Chinese and Japanese documents have a 
different range of C values. Intuitively, language identification 
can be based on the C, K and V values.
Results of oriental language classification by using C, K and V values:
Language  Samples  Not processed   Recognition(%)  Error(%)  Reject(%)
- - ----------------------------------------------------------------------
Chinese    114     1               94.69           4.43      0.88
Japanese    49     0               95.92           0.00      4.08
Korean     106     1               93.33           6.67      0.00       
Confusion matrix when using C, K and V values:
            Chinese       Japanese     Korean      Reject
- - ----------------------------------------------------------
Chinese        107           5            0           1
Japanese       0            47            0           2
Korean         0             7           98           0
 
   In an effort to improve the results, K-means clustering algorithm is also 
adopted. Based on the size of the training samples in our database and
on some preliminary results, 4 clusters are generated for each of the
three languages and hence 12 clusters have been used to represent the
training data. For a given testing document, a full search through the
12 clusters is made in order to find the best match.
Results from clustering using C, K and V features:
Language   Samples  Not processed   Recognition(%)   Error(%)  Reject (%)
- - ------------------------------------------------------------------------
Chinese    114       1               94.69           4.43      0.88
Japanese    49       0               97.96           0.00      2.04
Korean     106       1               97.14           1.91      0.95              
Confusion matrix from clustering using C, K and V features:
            Chinese       Japanese     Korean      Reject
- - ----------------------------------------------------------
Chinese        107           5            0           1
Japanese       0            48            0           1
Korean         0             2            102         1 
       
   An analysis of the results indicates that Chinese documents tend to be 
recognized as Japanese when they are written in Kai-font, in which strokes 
are smooth and do not touch other, there are fewer complex structures in 
this font. Some Korean documents are misclassified as Japanese when 
"ellipses" are used to represent "circles", because these "ellipses" look 
more like rectangles than circles.
           
   In summary, our method of LC works well in the processing of documents 
containing a mixture of both language groups (which are quite 
common in technical documents), provided that the non-host language(s) 
content does not exceed the limit of about 20% of the whole document. It 
also has been developed to handle documents that might be written in any 
of 24 different languages. It works well on Cyrillic documents that
do not possess the same characteristics as documents in Roman languages.
However, our Korean circle detection method cannot separate perfectly
Korean circles from ellipses and hence the recognition rate will decrease 
when ellipses are used in certain Korean fonts. Also, for the Chinese Kai 
font, the complex structure cannot be detected easily as the strokes of
this font are generally smooth and they do not touch each other.        
Acknowledgements
This research was supported by grants from the FCAR Program of the
Ministry of Education of Quebec and the Natural Sciences and Engineering
Research Council of Canada.
References
1. J. Ding, L. Lam, and C. Y. Suen, "Classification of Oriental and
   European scripts by using characteristic features," Proc. ICDAR'97,
   pp. 1023-1027.
2. D. S. Lee, C. R. Nohl, and H. S. Baird, "Language identification in
   complex, unoriented, and degraded document images," Proc. IAPR Workshop
   on Document Analysis Systems, Malvern, Pennsylvania, Oct. 1996, pp.
   76-98.
3. L. Lam and C. Y. Suen, "Application of majority voting to pattern
   recognition - an analysis of its behavior and performance," IEEE Trans.
   Syst., Man, and Cybern., vol. 27, 553-568, Sept. 1997.