Intelligent Enterprise featuring Transform
START NEWS & ANALYSIS OPINION CHANNELS PRODUCT GUIDES REVIEWS TECHWEBCASTS
CONTACTS ARCHIVES ADVANCED SEARCH
Rate & Review
Letter to the Editor
E-mail Article
Print Article
November 2002

Document & Data Capture Software Guide

by Doug Henschen

Developments such as Web-based self-service, enterprise application integration and inter-enterprise supply chain automation may be on the rise, but organizations have only begun to eliminate paper-based information from their day-to-day operations. Against this backdrop, better, faster and more affordable document imaging technologies are making it ever easier to transform documents into easily accessible electronic content.

Document imaging is the technology of choice for tackling mission-critical transactional challenges such as expediting account or loan applications, fulfilling orders, processing claims, reconciling accounts payables and supporting customer service. In many cases, regulations and/or best practices all but demand imaging to ensure accessibility and/or preservation of records related to pharmaceutical trials, medical histories, financial transactions, land deals and — as underscored in recent headlines — corporate accounting.

Even casual ad hoc document imaging applications deliver tremendous value. Product design teams, architects, engineers, lawyers, government agencies and law enforcement authorities are among those who now instantly share electronic images rather than wait for paper originals or copies (that is, if the originals can be found).

Capturing Documents

Those scanning large volumes of documents generally want to either capture the document or capture the data off the document. Sometimes they want to do both. Document capture is about scanning, indexing and storing document images to a back-end document/ content management system from which they can be retrieved. Increasingly, management systems are integrated enterprisewide, so you can pull up images (and other content) through portals, self-service Web sites, ERP systems, accounting systems and CRM systems as well as from the document/content management system itself.

The goal of document capture is to be able to access important records in an instant, and the means of retrieving is the index information (or image metadata). The depth of indexing varies from just a few fields (name, date and account number, for example) to a full-text index of everything on the page. Indexing can be done at the batch or document level, and the index values can be entered manually (with "key-from-image" entry) or with the assistance of recognition technologies such as barcodes, optical character recognition (for machine print) and intelligent character recognition (for hand print). The higher the volumes and the more consistent the documents, the more it makes sense to employ recognition technologies. Depending upon the value and importance of finding specific images, you can add human oversight steps to verify the accuracy of index values extracted via OCR or ICR.

Capturing Data

Data capture, as it applies in document imaging, is all about extracting crucial data from document images. In many cases the images can be deleted once the relevant data is validated. A survey or a club membership application, for example, carries crucial data, but there may be little value in storing an image of the original document. An insurance claim, on the other hand, would likely be captured both as a document image and as the discrete data from that image.

Also known as forms processing, data capture typically expedites transactional processes, speeding thousands, tens of thousands or even hundreds of thousands of claims, applications, tax returns or other forms per day. Data capture tends to be much more demanding than document indexing, with 10, 20 or even scores of individual data points that must be gathered and verified.

Automated data capture processes employ recognition engines and validation techniques to read and verify the data fields. As in document capture, recognition technologies offer higher returns as volumes increase. Data capture systems incorporate features such as voting between multiple OCR engines for greater accuracy and ICR engines capable of reading unconstrained hand print. Top-end systems are replete with data validation methods, business rules and database lookup options.

Depending on volume, form complexity and form consistency, sometimes it is faster and more cost-effective to enter data manually. To support high key-entry rates, most data capture systems offer aids to key-from-image entry. Page and field views, for example, present the full image or enlarged image zones, respectively, next to the entry field. In double key entry, a second operator re-enters data that has already been entered by another operator; the two values are compared to ensure accuracy. These techniques can also be used to edit and verify OCR and ICR results, and some systems will display all the low-confidence results at the word or character level for correction.

Today's most sophisticated data capture systems can deliver data to and interact with a range of enterprise applications, ensuring accuracy and initiating business processes. For example, a system might capture a vendor number from an invoice using OCR, validate it against a database and post the complete transaction data to ERP so a check can be issued. At the same time the image could be exported to a content management system through which the supplier could look up the record online. If the supplier reports that the payment wasn't right, employees in accounting could access the image to help resolve the conflict.

As in document capture, automation is easier if your forms are consistent, but several data capture systems are now able to handle variable forms. Invoices, for example, all typically contain the same data points — vendor names, invoice/P.O. numbers, SKU numbers, quantities, subtotals and totals — but form layout varies from vendor to vendor. Unstructured forms processing technologies can be trained to recognize such documents and extract the key data points. The technology isn't perfect, but if you can automate 80 percent of the data entry on 100,000 documents, you could save money and speed processes that were previously painfully slow.

While efficiencies have greatly improved, data capture from images can still be a costly, time-consuming process. It's no surprise, then, that many organizations are turning to electronic forms that can be published and accessed online. Recognizing the advantages and appeal of e-forms, many forms processing vendors have added options or built-in tools for gathering data electronically. One advantage of this integrated approach is that data from any source can be validated and exported through a single workflow. Stand-alone e-forms systems are also available, with Web- and PDF-based electronic forms being two popular means of gathering data online.

How to Use This Guide

Nearly every content management vendor that addresses document imaging offers a built-in or optional document capture front end. Depending upon your application, this tool might be more than sufficient for your needs. This product guide addresses only third-party products that are sold apart from management systems. With few exceptions, these third-party tools tend to be more robust or specialized in some way. For example, third-party products might offer more sophisticated recognition technologies, higher-volume processing, specialized PDF conversion features or more options for exporting and routing data and/or images. In addition, while you can use simple document capture tools as a starting point for manual data entry, few content management vendors would claim to offer robust data capture (forms processing) capabilities.

Capture starts with scanning. Thankfully, the latest scanners generally offer great image quality. Many document capture products and all data capture products still incorporate image processing features such as despeckle, deskew, background removal, text enhancement, and so forth, but such features have been largely commoditized. As a result, this table doesn't address image processing features.

Most of the categories in the table on pages 36 and 37 are self-explanatory. Some products are exclusively aimed at document capture, some at data capture, but it is not uncommon for forms processing products to handle both tasks. If a product claims to handle both challenges, make sure it has all the features you'll need. For example, many document capture systems offer full-text OCR and most have batch export capabilities that will place images and indexes in your back-end management system; some data-capture systems don't include these features. On the other hand, many document capture systems have built-in recognition engines, yet they don't have robust validation and data export features required for forms processing.

Look beyond what you need to do today when considering features such as input and output formats. Will you want to bring electronic documents and e-mails into the same capture stream alongside bitonal images and faxes? Will you want metatags or captured data to be transformed into XML? Are you likely to upgrade databases, and is there a plan to roll out an ERP system? Are prebuilt integrations or export modules available for the technology you already have in place?

Also consider the possibility of bringing forms online. Some data capture systems support electronic forms and documents as an input source, while others also have modules for publishing electronic forms. The table on page 39 provides detailed coverage of stand-alone e-forms processing systems and tools.

Consider Distributed Capture

One contentious issue among capture vendors is the question of support for distributed scanning and validation. Nobody questions whether distributed approaches make sense. By capturing documents at their source and sending images electronically, you can save time, improve service and expedite financial transactions involving millions, if not billions, in revenue. Once information is captured electronically, you can save again by sending high-volume data entry and validation work to low-cost labor markets.

The real question is, how do you support a distributed architecture? Virtually every product listed here can support distributed use if you provide a WAN or virtual private network (VPN) with all the required connections. On the other hand, some vendors have developed combinations of Internet-server-based systems and thin clients or browser plug-ins designed to simplify matters. In most cases, these features are designed to let untrained business users choose a document type from a drop-down menu, and the system will then automatically initiate the proper scanner settings, workflow steps and indexing/data capture processes. Similarly, some data capture vendors have created modules that harness the Internet to support remote validation. Because everything is administered and deployed from the central system, you have complete control over the workflow and you don't have to worry as much about remote software installs and support issues.

While thin-client/browser-based systems can offer advantages in terms of ease of use, ease of deployment and flexibility, they are generally slower and less capable of handling production volumes and speeds than conventional thick-client software. If supplying network connections and supporting software at all of your distributed locations isn't a big deal, thick clients may be a better choice. Another option some vendors choose is to supply Citrix or other terminal servers that let workers access a central system from a remote location.

Look for the Vendor's Specialties and Strengths

If your needs are routine, you might find more than enough capabilities in a bundled or low-cost capture system provided by your content management vendor. For many of the third-party products included in this chart, specialization is the name of the game. And who wouldn't want to consult a specialist? If your company processes medical claims, by all means choose a vendor that has claims-specialized software and lots of experience handling claims. Look for relevant industry case studies online and ask for contact information for reference customers. It's helpful to try before you buy. Some companies offer trial software, or, if you're contemplating an enterprisewide deployment, you can start with a pilot project or departmental rollout before you sign a big contract.

Our product guide is a starting point rather than the final litmus test for your selection, but it's designed to help narrow the search to the candidates that best fit your needs.




Channels
Business Process Management
Content Storage
Content Management
Compliance
Enterprise Solutions
Document Scanning & Capture
Content Delivery & Publishing
Collaboration & Knowledge Management
Search and Classification
Locate an article from our print magazine. Just enter your Locator ID Number below.
ID#


NEWS FROM THE PIPELINE

OpenOffice.org 2.0 Closes On Final

New Study Finds Steep Growth For Smartphones

PalmSource Sale Cleared By Federal Agency

CTIA Panel Examines Enterprise Security Risks

[more]






HOME | ARCHIVE | REALWARE AWARDS

A Publication of the Network Computing Enterprise Architecture Group
Brought to you by CMP Media LLC, Copyright © 2005
Privacy Statement | Your California Privacy Rights | Terms Of Service