Short Course on Optical Character Recognition

Short Course on Optical Character Recognition

October 26,28,30,- 1998

Optical Character Recognition (OCR) algorithms take as input
a scanned image of paper document and produce as output a symbolic
text document (e.g. ASCII, Word, or HTML). Text produced by OCR
algorithms can be searched and indexed by information retrieval algorithms.
Although researchers have worked on the problem of OCR for atleast
thirty years, there has been a renewed interest in OCR
technology in the recent years. This is partly due to:

i) the increasing need for efficient information storage and retrieval,
ii) the increasing need for cross-language information access, and
iii)the dramatic drop in scanner prices.

The purpose of this course is to teach the internals of an OCR system. Much of the time will be spent on OCR systems that are based on hidden Markov models (HHMs). The labs will allow you to experiment with sub-modules of OCR systems. No programming experience is necessary for the labs. Reading material will be provided at the course site.

Tentative course outline:

Day 1

Introduction:
- Image formation, scanning
- From pixels to words
Evaluation:
- Scientific methodology: hypothesis and test
- Datasets, groundtruth, string matching
- Limitations of commercial OCR systems.
Anatomy of a standard OCR system:
- Top-down vs. bottom-up, noise removal, skew estimation, page segmentation, line detection, blobs and cutting tools, classifiers, lexicons.
LAB: Run commercial OCR systems (e.g. OmniPage and TextBridge) on a small set of documents, evaluate the OCR accuracy, see where the products do not perform well.

Day 2

Line segmentation:
Statistical Pattern Recognition:
- Probability, statistics, Bayes theorem
- Clustering: K-Means
- Classification: Decision Trees, Neural Nets
Hidden Markov Models (HMM):
- HMM Formulation
- Forward Algorithm
- Viterbi Algorithm
- Baum-Welch Algorithm
LAB: Run programs for training and testing k-means clustering,
decision trees, and HMMs

Day 3

Hidden Markov Models (contd.):
- Connected words, two level dynamic programming
- HMM training
OCR algorithm summary:
Topics in OCR:
The following sub-topics in OCR will be discussed briefly.
- Degraded documents
- Logical structure extraction
- Colored/textured background
- Evaluation of segmentation results
- Use of linguistic resources
- Voting OCR
- Tables, maps, line drawings, music
- Language identification, multilingual OCR
- Handwriting recognition: online and offline
- Duplicate documents
- Document categorization and routing
- Sensitive word/document detection and redaction
- OCR from camera/video
- Information retrieval from noisy OCR'd documents
- Retrieval without OCR
LAB: HMM OCR continued.

Last modified October 20, 1998.