Intelligent Enterprise featuring Transform
START NEWS & ANALYSIS OPINION CHANNELS PRODUCT GUIDES REVIEWS TECHWEBCASTS
CONTACTS ARCHIVES ADVANCED SEARCH

December 1998

Context and Validation Boost ICR Accuracy

All contemporary ICR engines augment their recognition algorithms with robust context analysis and post-edit routines -- further supplemented by table look-ups plus inter-field and intra-field data validation procedures -- to obtain the highest possible ICR accuracy. Once a user learns how to optimize these tools, they can control and calibrate ICR accuracy.

When ICR was first introduced, there was great emphasis on so-called "raw" recognition rates. An ICR engine that only substituted a "U" for a hand printed "V" in the word "invoice" was therefore judged to be twice as accurate as an engine that substituted the characters "0" (zero) and "1" (one) for the hand printed letters "O" and "I" in the same word.

Introducing context by applying an English dictionary against recognition results could correct the first error. The two errors in the second example could be corrected simply by designating the "invoice" field as alphabetic-only. Embedding context analysis and dictionaries within the ICR software ensures 100% accuracy in both cases.

Context analysis can be as simple as informing the ICR engine that the character field in question is a date field (and will therefore not contain alphabetic text). Or it can be a complex task that involves telling the ICR software to look for product codes that follow a particular alpha-numeric syntax, with predefined ranges of values for specific characters. Even knowing the type of form being used can help an ICR system immensely. For example, there are words that appear on a medical claim form that would never be found on a payroll report.

The following analytical and validation tools are used by most ICR software developers to improve recognition performance:

  • Edit masks - customized programming templates for specific form applications (like medical claims) that predefine data field attributes such as number of characters, alphanumeric syntax, punctuation, and acceptable values, as well as data validation routines.
  • Customized dictionaries - application-specific dictionaries for vertical markets such as order entry, medical claims, financial services, etc. All of the major forms processing vendors include tools for creating user-defined, application-specific dictionaries.
  • Lookup tables - lists of acceptable data field entries: ZIP codes, employee names, serial numbers, etc. Forms processing vendors such as Cardiff, Datacap, MTI and Captiva independently developed database validation based upon lookup tables; now, through the proliferation of OLE and ODBC utilities, validation has become a universal practice.
  • Range checking - defining a specific set of acceptable values for a given field using operators such as "greater than," "less than," "except," etc.
  • Relationship validation - use of equations or arithmetical algorithms to define and verify constant logical or mathematical relationships between various data elements on a form. For example, "gross salary amount" X .075 = "monthly pension contribution."
  • Check sum digits - a means of validating digit fields such as VISA card numbers, serial numbers, etc. by using a formula in which the last digit is always derived from performing specific mathematical operations on the preceding digits in the numeric character string.
  • Spell checking - use of spell checkers that can be user-modified to fit the vocabulary of a given form application.
  • Rules of grammar - use of grammatical rules, first used by Nestor, to check spelling and word arrangements in open fields (i.e., " ýI' before ýE' except after ýC' ", "U" follows "Q", etc.)
  • Trigram analysis - method of word analysis pioneered by AEG that examines characters in groups of threes, i.e., "earthquake" is the only English word that contains the "thq" three-letter sequence.
  • ICR voting - using two or more ICR engines simultaneously and then comparing the results to gain consensus. For example, some ICR engines are better at telling the difference between the letter "m" and the letter combination "rn."
  • Specialized neural nets - using a specialized neural net within an ICR engine to recognize a set of characters often confused with each other. This practice is similar to ICR voting and can be just as accurate. Both Mitek and NestorReader employ this approach.
  • Database Validation. Since the basic nature of a form is such that its data elements are highly structured and heavily interrelated, data validation routines are a powerful tool for boosting accuracy. By using the right validation routine, an illegible data field (say a patient name) that would ordinarily create low-confidence character choices can be automatically compared against another related field in the same form (say a patient ID number) that is more legible.

A mere mortal cannot be expected to compare all of the Zip codes in the United States against a specific numeric field. Nor could one validate in his/her head a product code against a dictionary of item numbers from the entire L.L. Bean catalog. But computers can do these tasks in the blink of an eye.


Main Article:

 




Channels
Business Process Management
Content Storage
Content Management
Compliance
Enterprise Solutions
Document Scanning & Capture
Content Delivery & Publishing
Collaboration & Knowledge Management
Search and Classification
Locate an article from our print magazine. Just enter your Locator ID Number below.
ID#


NEWS FROM THE PIPELINE

OpenOffice.org 2.0 Closes On Final

New Study Finds Steep Growth For Smartphones

PalmSource Sale Cleared By Federal Agency

CTIA Panel Examines Enterprise Security Risks

[more]






HOME | ARCHIVE | REALWARE AWARDS

A Publication of the Network Computing Enterprise Architecture Group
Brought to you by CMP Media LLC, Copyright © 2005
Privacy Statement | Your California Privacy Rights | Terms Of Service