Optical Character
Recognition (OCR)
See also: The
free Microsoft Office Document Imaging 2003 (MODI)
Object Model and Viewer OCX Control. This
Object Model and Viewer OCX Control ships for free with
Microsoft Office 2003 and provides programmable image display and OCR
functionality.
Overview
Optical character recognition (OCR) is the process of
using computer systems to
translate images of typewritten text into
machine-editable text (ASCII or Unicode). This
essentially converts texts in image form into its
digital equivalent. For example, OCR
enables you to scan a book or a magazine article; feed
the text in the scanned image
into an electronic file; and edit the file using a word
processor.
Most OCR systems use a combination of hardware
(specialized circuit boards) and
software to recognize characters. Some relatively
inexpensive systems do it entirely
through software.
OCR typically involves photo-scanning of the document
containing the text; analysis of
the scanned-in image; and translation of the character
image into digital characters. In
the translation process, the scanned-in image or bitmap
is analyzed for light and dark
areas to differentiate between images and text in order
to identify each alphabetic letter
or numeric digit. When a character is recognized, it is
then converted into an ASCII code.
Early systems required "training"
(essentially, the provision of known samples of each
character) to read a specific font. This is performed by
matching these text images
against stored bitmaps based on specific fonts. The
"hit-or-miss" results of such pattern
recognition systems are commonly inaccurate.
Advances in technology has produced current
"intelligent" systems that use neural
networks and artificial intelligence to recognize most
fonts to a high degree of accuracy.
These advanced systems allow for background
irregularities of printed ink on paper,
analyze the stroke edge, and the line of discontinuity
between the text characters. The
system then averages the variables and matches the
results to known characters to
make a best guess as to what the character is. Multiple
algorithms can be applied and
then averaging is performed on the multiple results to
obtain a single reading.
Common Problems
OCR is commonly accurate when the text is sharply printed. However, when the
characters are broken or not properly printed, OCR typically fails to recognize the text.
Pages with complex formatting, smudges, and unusual fonts may require more
processing power and time. For example, a low-contrast, creased page from a
newspaper will take considerably more time to process than will a clean, crisp,
high-contrast printout on laser-quality paper.
Most software uses white space to try to recognize the text in appropriate order.
Complex formatting (e.g. typical multi-column newspaper layout) such as cross-column
headings, tables, indented text, footnotes, headers, text wrapped around
images, and
margin notes confuses the order of the text in the OCR process and requires manual
delineation prior to OCR. This delineation process is typically referred to as zoning.
Misapplication of lexicons or mixing character sets (e.g., when more than one language
dictionary is loaded) present additional linguistic complexities
to the OCR system.
The character sets of certain languages might not be supported.
Images interspersed throughout the text will usually be ignored by the OCR software;
they will be dropped from simple output formats such as ASCII
Other common problems include the fact the the image texts could include spelling
mistakes and the text images that are used are of low quality
(e.g. the images are
capture at low resolutions or the original physical documents are
stained).
|
|
 |
|