December 1998
The State of Optical Character Recognition
Arthur Gingrande
Whether bundled with your scanner, embedded in your forms processing or customized for your app, OCR is more accurate and versatile than ever before. Here's a rundown on the technology and a recap of leading products.
The old proverb inspires a riddle for imaging geeks: when is a picture not worth a thousand words? OCR buffs know the answer: when it's a picture of a thousand words. The logic behind the answer, of course, is that pictures of words are not computer-usable data by themselves. To become real information, their bitmap images must be converted into ASCII data by using optical character recognition. In addition to being a conversion tool, OCR is also a form of data compression, due to the fact that ASCII characters take up less than 10% of the disk space of their corresponding bitmap representation.
OCR refers to optical recognition of machine-printed characters. It has been in use for more than 40 years. Originally a "dumb" application based upon template matching and feature extraction, OCR has evolved into a "smart" application that employs neural networks, topological analysis and other sophisticated technologies to accurately classify characters. OCR should not be confused with ICR (intelligent character recognition), which relies on similar innovations and has come to mean recognition of hand printed characters.
OCR use breaks out over four major application areas: automated forms processing, text recognition, content recognition and raster-to-vector conversion. While all utilize the same OCR engines, they differ significantly in configuration and support requirements.
Forms Processing
Because forms are primarily a means for formally communicating structured data, forms processing OCR deals with what is, for the most part, expected. Form recognition zones are tightly defined in terms of location and attributes. Typically, data fields are located by creating a form template that maps out the coordinates of the recognition zones. These zones are then defined in terms of the number of expected characters, character type (alpha, numeric or alphanumeric), field type (date, social security number, etc.) and field syntax (predefined alpha and numeric order of characters). Sophisticated forms removal software, provided independently of the OCR software, is used to drop out the "passive" form data from the "active" machine printed data.
Forms processing accuracy tends to be greater than text recognition accuracy for a number of reasons. There are fewer fonts to classify: most machine print on forms is Times Roman, Arial (Helvetica) or Courier. All data is predefined on form templates, which gives the OCR engine advance and precise notice of what to expect.
Validation routines can be set up that compare interrelated data fields with each other. For instance, recognized columns of numbers can be added up and checked against the "sum" data field. Moreover, application-specific dictionaries (e.g., product names and codes) and lookup tables (e.g., ZIP codes) can be used to verify certain words that are unique to a particular application.
Vendors that supply OCR engines to forms processing software developers for resale include Mitek (San Diego, CA 619-635-5900 www.miteksys.com), AEG from Siemens/CGK (Vienna, VA 703-848-2117 www.cgk.de), Caere (Los Gatos, CA 408-395-7000 www.caere.com), ScanSoft (formerly Xerox Imaging Systems, Peabody, MA 978-977-2000 www.scansoft.com ) and NewSoft America (Fremont, CA 510-445-8600), which acquired Maxsoft Ocron's OCR product line.
Major applications in forms processing are defined by industry and by form type, and include processing of medical claims, internal revenue forms, surveys, warranty cards, catalogue orders, credit card applications, remittances, mutual funds forms, proxy statements and all manner of state and local government forms.
Text Recognition
Text recognition deals with more unknowns than forms processing. The biggest difference is the increase in the number of fonts that must be classified. The dictionary used in text recognition is generic (e.g., English or Spanish), not specific (e.g., a specific manufacturer's product codes). Validation routines are non-existent, and format detection is extremely complex. OCR software must be able to accurately segment text from photographs, illustrations, and other graphic objects. If the physical layout of the page is an important factor, then the OCR software must be able to retain and reproduce the original format of the page (format retention capability) so that a printed version of the OCR translation would closely reproduce the bitmap image of the page.
With such broad requirements and so many variables to accommodate, it's not surprising that OCR accuracy in text recognition is lower than it is in forms processing. For example, the norm for recognition on a per-character basis in forms processing is equivalent to a human, which is defined as 99.5%. In a pure text environment, this figure could drop to as low as 98%, depending upon font size and complexity of the page format.
Because of its very nature, automated forms processing is high-volume, batch-oriented and accuracy-intensive. The same holds true for high-production business text recognition applications, such as resume processing and recognition of legal documents (litigation support).
OCR accuracy, of course, is never unimportant. But in text recognition applications in a desktop environment (OCR of individual fax transmissions and email attachments, for example), much more emphasis is placed upon features like format retention and user-friendliness.
To do text recognition, OCR vendors must supply more than a recognition engine. They must provide the user interface and supportive software. Vendors such as Caere, ScanSoft and Expervision (Freemont, CA 510-623-7071 www.expervision. com) integrate all the features required for the job -- including page segmentation, zone definition, image cleanup and enhancement, and format retention -- into their software packages.
Content Recognition
The goal of content recognition is to replicate a page in a way that is as faithful as possible to the appearance of the original while demanding very little in the way of storage requirements. The PDF file format produced by Adobe Capture has set a new model for content recognition of free text pages. To this end, OCR functions as much as a data compression vehicle as it does a conversion device. Because the goal is to faithfully reproduce the image of the subject page, the choice of font, say, is as important as recognizing the characters. In PDF, images retain their graphic character. Font and graphic locations on the original should correspond to those on the new page converted to PDF format.
Strictly speaking, the translated text output of PDF is not pure text. In the case of low confidence characters, instead of the usual OCR symbol for a questionable character (such as a tilde), a bitmap image of those characters is substituted instead. That is because PDF conversion is intended for publishing (in print or on the Web) rather than data entry. Consequently, each PDF page is content-independent of every other PDF page in the same publication.
Because of the popularity of the Adobe PDF format, big vendors like Caere and Xerox now offer PDF as an output format. Prime Recognition (San Carlos, CA 650-637-8382) has taken the process one step further by developing a special engine that can output PDF faster and more accurately than Adobe itself.
Raster-to-Vector Conversion
Forms raster-to-vector conversion is an application originally developed as an aid for forms creation and design. It is similar to content recognition in that both applications involve converting the bitmap image of a document into a file format that faithfully reproduces the appearance of the original document while reducing the storage requirement significantly.
Unlike content recognition, raster-to-vector conversion is a form-design application rather than a publishing application. The process allows images such as company logos to retain their character as bitmap images, while images of lines, boxes and other geometric shapes are transformed into vector representations of design objects. The latter can be redrawn and reformatted during the process of updating and redesigning a given form. The text that composes the names of data fields and the fine print on a form is optically recognized and translated into ASCII using OCR. Font selection is left up to the user when they correct the converted form.
Many electronic forms vendors have attempted to develop a forms conversion package, but GDI (Novato, CA, 415-382-6600) was the first to achieve technical success in the early 1990s with its Windform OCR package. Caere later developed a Windform OCR clone called OmniForm, which has become one of Caere's flagship products.
The initial objective of each product was primarily graphical -- to produce a forms conversion and design utility. Now each has become the front end of systems that are standalone data entry tools.
Caere has developed an array of data validation, data management and Web form tools that come embedded in OmniForm. With the help of a wizard, users can easily scan in a form, OCR all of the data field labels and edit it, create fillable fields, then database-enable the form. Features like spell-checkers, dictionaries and auto-calculation enable data validation. Built-in intelligence allows automatic filling in of repetitive fields.
Forms can be saved to Microsoft Word, giving others free access to them. Forms designed in MS Word can be automatically converted into the OmniForm format. Using Microsoft Exchange routing slips, forms can be easily routed around a network for multi-workstation use, or faxed or printed. Exporting the form to OmniForm Web Publisher allows forms to be converted to HTML or PDF format for publication at Websites.
GDI's electronic forms products family, Windform 3.0, Windform for ICR, Scan and Type, DataGold ICR, and Iforms -- all compatible with Windform 3.0 OCR -- replicates and in some ways goes beyond OmniForm.
For example, Windform OCR's Active X controls embed forms in a Web page with the native intelligence of original form objects. This puts Windform 3.0 on an equal footing with Caere's OmniForm Web Publisher product. Forms exported to or designed using Windform for ICR become "ICR friendly." IForms enables Web and PDF output. Finally, Data Gold ICR enables forms processing on filled out forms employing NestorReader ICR and Textbridge OCR.
What About Speed And Accuracy?
What can you expect to get out of a retail shrink-wrapped, off-the-shelf OCR package in terms of speed and accuracy in a text recognition application? It all depends upon what you are recognizing and what you're using to do it with.
For example, on a 300 MHz Pentium Pro, the speed of recognizing a clean, laser-printed, three-column article can exceed 200 characters per second at a word recognition accuracy rate of 99.5%. This means that if the article is 1,000 words long, then you'll only have to correct five words instead of typing all 1,000.
On the other hand, a fax copy of the same document would go a lot slower ý less than 90 CPS ý and the accuracy can get as bad as 89%. That's because the engine has to work. The cleaner and sharper you can make your originals, the better time your OCR engine will have interpreting them. U
Arthur Gingrande is a partner of Imerge Consulting. He is based in Arlington, MA, and can be reached at arthur@imergeconsult.com (781-646-1893).
Related Articles: