Optical Character Recognition (OCR)


See also: The free Microsoft Office Document Imaging 2003 (MODI) Object Model and Viewer OCX Control. This Object Model and Viewer OCX Control ships for free with Microsoft Office 2003 and provides programmable image display and OCR functionality.


Overview

Optical character recognition (OCR) is the process of using computer systems to
translate images of typewritten text into machine-editable text (ASCII or Unicode). This
essentially converts texts in image form into its digital equivalent. For example, OCR
enables you to scan a book or a magazine article; feed the text in the scanned image
into an electronic file; and edit the file using a word processor.

Most OCR systems use a combination of hardware (specialized circuit boards) and
software to recognize characters. Some relatively inexpensive systems do it entirely
through software.

OCR typically involves photo-scanning of the document containing the text; analysis of
the scanned-in image; and translation of the character image into digital characters. In
the translation process, the scanned-in image or bitmap is analyzed for light and dark
areas to differentiate between images and text in order to identify each alphabetic letter
or numeric digit. When a character is recognized, it is then converted into an ASCII code.

Early systems required "training" (essentially, the provision of known samples of each
character) to read a specific font. This is performed by matching these text images
against stored bitmaps based on specific fonts. The "hit-or-miss" results of such pattern
recognition systems are commonly inaccurate.

Advances in technology has produced current "intelligent" systems that use neural
networks and artificial intelligence to recognize most fonts to a high degree of accuracy.
These advanced systems allow for background irregularities of printed ink on paper,
analyze the stroke edge, and the line of discontinuity between the text characters. The
system then averages the variables and matches the results to known characters to
make a best guess as to what the character is. Multiple algorithms can be applied and
then averaging is performed on the multiple results to obtain a single reading.


Common Problems

OCR is commonly accurate when the text is sharply printed. However, when the
characters are broken or not properly printed, OCR typically fails to recognize the text.

Pages with complex formatting, smudges, and unusual fonts may require more
processing power and time. For example, a low-contrast, creased page from a
newspaper will take considerably more time to process than will a clean, crisp,
high-contrast printout on laser-quality paper.

Most software uses white space to try to recognize the text in appropriate order.
Complex formatting (e.g. typical multi-column newspaper layout) such as cross-column
headings, tables, indented text, footnotes, headers, text wrapped around images, and
margin notes confuses the order of the text in the OCR process and requires manual
delineation prior to OCR. This delineation process is typically referred to as zoning.

Misapplication of lexicons or mixing character sets (e.g., when more than one language
dictionary is loaded) present additional linguistic complexities to the OCR system. 

The character sets of certain languages might not be supported.

Images interspersed throughout the text will usually be ignored by the OCR software;
they will be dropped from simple output formats such as ASCII

Other common problems include the fact the the image texts could include spelling
mistakes and the text images that are used are of low quality (e.g. the images are
capture at low resolutions or the original physical documents are stained).




 


  Developer's Corner
 
Introduction
Imaging Toolkits
TWAIN
Image Formats
Color Spaces
Auto Recognition
Useful Links
Learn More...
Input Requirements
ICR/MICR/OMR
OCR Toolkits...
Freeware/Shareware
Considerations