Data Capture MVPs
Whats new in data capture? Web-based forms, better forms ID and color support are pointing the way to the future.On the Web front, vendors including Cardiff, Com Com Systems, Microsystems Technology, Mitek and National Computer Systems have added support for electronic forms to their production forms processing systems. These systems apply the same validation and processing rules to forms filled out on the Web that they do to paper forms. Others have provided support for scanning forms and verifying data over the Internet from a remote office, a home or an offshore operation.
Color document imaging is still in the minor leagues, but it holds a lot of promise for forms processing. Several scanners have been introduced this year that offer affordable color document scanning in high volumes, and software vendors are starting to support them.
New identification and sorting technologies have improved form ID. The latest products can distinguish between very similar forms. Theyre also recognizing data from non-forms, such as letters and reports.
Another trend has been the convergence of data and image capture. Forms processing companies like Datacap and, more recently, Cardiff and Captiva have moved into the image capture market. Meanwhile, image capture vendors such as Kofax have added forms processing (a.k.a., data extraction) features.
This article focuses on the data side of capture what many still call forms processing. In each of the sections that follow, we examine the specific features you need to look for in the latest forms products. Each section corresponds to a category on our exhaustive table on pages 22 and 23. In the text, we offer our opinion on the state of the industry what most people are doing versus the state of the art the best-of-breed technology.
Cost. Different vendors use very different pricing structures. GTESS, for example, sells its HCFA processing system as a service where the user pays by the page. Most of the labor involved in reading the forms is handled by GTESS at their forms factory in Texas. Other vendors, such as Dakota Imaging, sell systems on a lease or per-click basis, permitting users to use their operating savings to finance the purchase of the system. Still others, such as Microsystems Technologies, sell only through resellers, so end-user prices, as well as service levels, vary depending on which reseller youre dealing with. For the chart in this article we settled on using the cost or a cost range expressing the typical upper and lower limits of system pricing.
Customizability/scalability. This is a measure of the functional range of the product. To what degree can the workflow, recognition processes, data entry and export functions be tailored for individual applications? This measure can be applied to the vendor as well as the product. Accra from NCS, for example, is an OEM version of OCR for Forms from Microsystems Technology. While its the same base product, NCS sells it directly, targets high-volume applications and typically includes integration and customization fees as part of a higher price. Accordingly, our table lists Accra as a more customizable solution than OCR for Forms.
Scalability gives you an indication of the range of document and/or data capture volumes you can handle in a cost-effective way. Less scaleable products are either priced too high to be competitive for low-volume applications or are not able to handle high volumes. You should be sure that your product has the potential to scale to all of the applications you plan to implement. For example, it can be very expensive to implement one product for order entry and then find you need a different product when you later add invoice processing for accounts payable.
Form creation. Some types of forms you can create and control yourself, such as surveys, warranties and order entry forms. Others, such as medical claims, invoices and tax returns, you cant design yourself. Most of these products dont provide form creation, relying instead on third-party form definition products to do the job for them. This makes sense if the design of forms is not under the control of the company entering data from them. If you are able to design your own forms, Cardiff Software, MTI and NCS all have highly effective and useful form definition modules. Cardiffs recently released PDF+Forms module permits creation of forms in PDF, paper or HTML format. It extends the PDF format to include not only the definition of the appearance of the form but also the instructions regarding how to read data that has been entered into it electronically on Web sites. This expands on Acrobat forms that feature simple JavaScript validations.
MTI also permits definition of both electronic and paper forms simultaneously, and in the same formats using Java to perform the online validations. Teleform Standard, the fax-oriented low-end product in the Cardiff lineup, requires Teleform-designed forms in order to operate.
Form ID and automated sorting. The best products are able to sort documents into both overall classes, such as form type, and into subclasses, such as the various variants of HCFA 1500 health claim forms. Captiva Formware specializes in recognition of subtle differences its neural network form IDfeature can be trained to understand small differences between forms. Look for identification routines that dont require OCR of a certain field in order to sort the OCR process introduces a margin for error and can slow down the sorting. Instead, look for form identification that uses the overall form appearance and geometry. These routines look for identifying characteristics such as line junctions, logo placement or the length of line segments. They are generally very fast and are becoming so good that OCR-based sorting is rarely needed.
Products from Mitek and Ceresoft are able to accurately identify a form type such as an invoice by utilizing fuzzy logic to look at the overall structure and from there find the relevant fields regardless of their locations.
ReadSoft and Mitek both have specialized invoice processing technology that automatically groups invoices and learns new invoice types as they arrive. IBMs IFP and Miteks Doctus, along with other products, offer users the ability to select form geometry in general, the geometry of a selected portion of the image or OCR of a selected portion of an image for form identification.
Kofaxs Ascent Capture 3.0 is efficient at identifying and sorting forms. A number of companies use Kofaxs form ID technology, which is part of their Image Controls toolkit.
Key from image. If your application requires hand entry of more than 100 characters per form, if you process thousands of forms per day and if you handle complex forms that are not easy to read automatically, the key from image module is probably more important to you than any other portion of the system. A fast key from image module lets operators work at a sustained throughput of 12,000 to 15,000 characters per hour. Leaders in raw key entry speed include Recognition Research, Captiva, Scan-Optics and Viking Data. Most of these products are built on key from paper software technology. This is an advantage for them because they are able to leverage years of experience in data collection validation routines that reduce the number of keystrokes required to fill out a form.
Viking has an installation processing 2.5 million tax forms a year for the State of West Virginia. These tax forms include a taxpayer ID number followed by a check digit field for that same number. By building a validation edit to compare the two, West Virginia was able to avoid double-keying both fields to ensure accuracy, saving 12 keystrokes per form. At $1.72 per 1,000 characters (the U.S. national average, according to the Data Entry Management Association) and 2.5 million forms, the dollar savings of this edit feature are about $43,000 per year.
Most of the forms processing software products discussed in this article have some data validation routines. If you are already using an installed key-from-image system and are considering conversion to image-based forms processing technology, you should check with your data entry software supplier to see if they have an upgrade for you before you make the investment required to switch vendors.
The newest option for presenting data to an operator to be verified is one that sorts characters by type and then presents a full screen of character images of a single type to the operator. For example, an entire screen filled with images of d. With a single glance, the operator can verify that all of the images are indeed ds, and with a single key stroke they can go to the next screen. Characters that cannot easily be discerned, such as zeroes and Os, are normally sorted separately by numeric and alpha fields. This mode can offer real improvement in the possible rate of manual verification and a real cost savings for high-volume applications. Mitek and IBMhave this feature.
Remote scanning, data entry and validation. Remote forms processing offers a few advantages. Remote scanning can reduce operating costs by reducing the amount of handling and shipping required before documents are captured. Sending the images to a central processing site and getting a fast response can allow for quick decision-making.
An example is car dealerships that fax completed loan applications to finance companies, which process the application automatically and either accept or reject the loan within minutes. Cardiff specializes in this type of application. Other examples include sales offices sending in orders, insurance agents sending new customer applications or policies, and independent stock brokerages submitting new account paperwork to their clearing houses.
Kofax has implemented remote scanning, image QA and validation through use of remote systems that can be synchronized periodically with the home site using the Internet as the communication platform. The synchronization moves all captured images and data to the home system and also permits updating of any system administration items such as form or workflow definitions. It also moves all operator statistics to the home system and permits close central control of the remote location.
Remote key entry and validation save operating costs by permitting lower-cost labor to perform the most time intensive function of the system. Large users can set up data entry departments in areas of the world with low labor costs, such as Jamaica or India, and then supply them with images for remote validation and data entry. A less grand, but still useful, application is to permit key from image operators to telecommute from their homes rather than coming to the office every day.
OCR for Forms is perhaps the most sophisticated at this time in supporting telecommuting and other forms of remote verification and key entry. Using the Internet as a communications platform, the software downloads and pre-loads images that are queued for processing at home, maximizing the bandwidth available. This online setup means that validation and other processing functions are working on the data as the user is entering it. The software permits the user to disconnect after a batch has been downloaded, which can free up the (perhaps only) telephone line in the household, cut telephone costs and free up data lines at the host. When the user reconnects to the central system, OCR for Forms automatically updates any new data collected without re-transmitting the images that were already downloaded.
System administration and statistics. Administrative statistics are critical to effective management of your system. If the manager doesnt know exactly what is happening, how can he/she know how to improve it? Statistics should include measurements of queue loading. This helps the manager spot performance bottlenecks caused by inefficient operation or, perhaps, inadequate hardware power in the system. They also should detail the performance of all modules of the system, including form identification, image QA, automated data recognition steps, data QA and data export. This information reveals what needs tuning and what the potential gain from tuning would be.
Finally, the system should collect complete statistics on each system user and their performance. This is particularly important in large systems where it can indicate operators who need training or error patterns that could be fixed through customization.
To date, none of the products studied offer a browser-based system administration module, though several vendors mentioned that one is under development. Cardiff Software has packaged their administration function as a native snap-in module for the Microsoft Management Console, an emerging standard many IT groups are already familiar with and using. IFPS, Cartouche and Formworks provide the greatest depth of information on the performance of the recognition modules in each field, permitting more precise recognition tuning than other products. Formware, Formworks and VDE+ Images represent the state of the art in KFI statistics. The system administration module that will prove most useful depends on the nature and focus of your application.
Multiple output streams. Multiple output streams permit the user to export the data and images to multiple destinations. This is valuable if the information being captured is being used to drive work processes. For example, invoice data entry is a driver for accounts payable. In this application, the data is normally exported to an accounting application, such as SAP, while the images and perhaps another copy of the data may be exported to a document management application, such as FileNETs Panagon, Optika eMedia or IBM VisualInfo. OCR for Forms has an excellent export module that enables all of this functionality with an extremely intuitive, easy-to-use interface.
Color capture support. Color scanners are rapidly being introduced to the market. The software vendors are scrambling to support color and have succeeded to varying extents. The only vendor with full end-to-end color support, including color OCR, is ComCom Systems. Their recently released ELT Color provides color character recognition. ReadSoft announced at AIIM that their next version of Eyes and Hands would have full color support. MTI and Kofax have color support that permits users to capture color documents and use the color for document sorting, identification and thumbnail user interfaces.
If you are buying a color-capable system, be aware that right now most products support specific color scanners only. You must purchase scanners, scanner interfaces and application software that are all compatible.
OCR of color images is a challenge because the major OCR vendors havent made their recognition engines color-capable yet. You have to threshold the color images to monochrome and then submit them to the OCR engines. Most color scanning is done at a lower resolution than the 200 dpi minimum expected by most OCR products. Therefore, the OCR has trouble yielding acceptable results on these images. One solution is to interpolate lower-resolution color images to higher resolution monochrome equivalents and then OCR them. SpectrumFix from TMS/Sequoia (www.tmssequoia.com) is the first released software product to eliminate color backgrounds. Their ScanFix allows for interpolation of 150 to 300 dpi. Look for interpolation to become common by the end of 1999, with a transition to native color OCR by the end of the year 2000.
Recognition accuracy. In general, accuracy is one of those items where you get what you pay for. The recognition accuracy of simpler, less expensive products such as Kofax Ascent 3.0, Cardiff Teleform and Captiva Genesis cant be compared to that offered by Recognition Research, RAF or TIS. The latter offer extensive tailoring capabilities that enable them to achieve greater recognition accuracy than competitors, but you have to pay for customization. While these high-end vendors may be highly competitive for high-volume applications, they are much less likely to be serious contenders for smaller systems where the productivity gains from small improvements in OCR accuracy cannot be turned into big hard-dollar savings.
Your challenge is to determine the best combination of price and recognition technology for your application. Lets say your system enters 500 invoices per day and you have to enter an average of 50 characters per form (25,000 characters per day). Your labor cost for a key entry or key from image system would be about $43 per day ($1.72 per thousand x 25). If you spend $30,000 on automatic recognition and succeeded in reducing entry labor by 75% (a good implementation), your savings per year of $7,095 (($43 x 220 days ) x 75%) is not high enough to justify the expense. You would be better off relying on key from image or a simple recognition system that reduces the data-entry load by 40% but costs only a few thousand dollars. If your volumes and labor cost are higher, then it will be easier to cost-justify a high-end system.
All vendors are working to reduce their customization time by adding more power to the software and improving the user interfaces. Top Image Systems (TiS), in particular has shown great progress in this direction. Its fully graphical workflow design capability greatly reduces setup time. Its SuperOCR feature begins to automate the tuning process by measuring the accuracy of the many recognition algorithms in the product, and then weighting each algorithm on a field by field basis, customizing the voting system for each field automatically and in real time. U
All-star Players
Microsystems Technologys OCR for Forms gets our Editors Choice for mid-volume applications. It has excellent recognition accuracy supported by good validation rules and image pre-processing. It also has leading-edge support for remote data entry and electronic forms.
At the high-end, two products get our nod. We recommend TISs AFPS Pro for applications where you need advanced workflow and mostly automated character recognition. This Israeli company has truly advanced technology, though it has only recently established a North American office.
Captivas FormWare excels at applications where you need strong key-from-image capability. The company has a well-established support infra-structure and offers VBA libraries for rapid integration and customization.
The Editors of Imaging & Document Solutions