November 2002
Document & Data Capture Software Guide
by Doug Henschen
Developments such as Web-based self-service, enterprise application integration and
inter-enterprise supply chain automation may be on the rise, but organizations have only begun to
eliminate paper-based information from their day-to-day operations. Against this backdrop, better,
faster and more affordable document imaging technologies are making it ever easier to transform
documents into easily accessible electronic content.
Document imaging is the technology of choice for tackling mission-critical transactional
challenges such as expediting account or loan applications, fulfilling orders, processing claims,
reconciling accounts payables and supporting customer service. In many cases, regulations and/or
best practices all but demand imaging to ensure accessibility and/or preservation of records related
to pharmaceutical trials, medical histories, financial transactions, land deals and as underscored
in recent headlines corporate accounting.
Even casual ad hoc document imaging applications deliver tremendous value. Product design teams,
architects, engineers, lawyers, government agencies and law enforcement authorities are among those
who now instantly share electronic images rather than wait for paper originals or copies (that is,
if the originals can be found).
Capturing Documents
Those scanning large volumes of documents generally want to either capture the document or
capture the data off the document. Sometimes they want to do both. Document capture is about
scanning, indexing and storing document images to a back-end document/ content management system
from which they can be retrieved. Increasingly, management systems are integrated enterprisewide, so
you can pull up images (and other content) through portals, self-service Web sites, ERP systems,
accounting systems and CRM systems as well as from the document/content management system
itself.
The goal of document capture is to be able to access important records in an instant, and the
means of retrieving is the index information (or image metadata). The depth of indexing varies from
just a few fields (name, date and account number, for example) to a full-text index of everything on
the page. Indexing can be done at the batch or document level, and the index values can be entered
manually (with "key-from-image" entry) or with the assistance of recognition technologies such as
barcodes, optical character recognition (for machine print) and intelligent character recognition
(for hand print). The higher the volumes and the more consistent the documents, the more it makes
sense to employ recognition technologies. Depending upon the value and importance of finding
specific images, you can add human oversight steps to verify the accuracy of index values extracted
via OCR or ICR.
Capturing Data
Data capture, as it applies in document imaging, is all about extracting crucial data from
document images. In many cases the images can be deleted once the relevant data is validated. A
survey or a club membership application, for example, carries crucial data, but there may be little
value in storing an image of the original document. An insurance claim, on the other hand, would
likely be captured both as a document image and as the discrete data from that image.
Also known as forms processing, data capture typically expedites transactional processes,
speeding thousands, tens of thousands or even hundreds of thousands of claims, applications, tax
returns or other forms per day. Data capture tends to be much more demanding than document indexing,
with 10, 20 or even scores of individual data points that must be gathered and verified.
Automated data capture processes employ recognition engines and validation techniques to read and
verify the data fields. As in document capture, recognition technologies offer higher returns as
volumes increase. Data capture systems incorporate features such as voting between multiple OCR
engines for greater accuracy and ICR engines capable of reading unconstrained hand print. Top-end
systems are replete with data validation methods, business rules and database lookup options.
Depending on volume, form complexity and form consistency, sometimes it is faster and more
cost-effective to enter data manually. To support high key-entry rates, most data capture systems
offer aids to key-from-image entry. Page and field views, for example, present the full image or
enlarged image zones, respectively, next to the entry field. In double key entry, a second operator
re-enters data that has already been entered by another operator; the two values are compared to
ensure accuracy. These techniques can also be used to edit and verify OCR and ICR results, and some
systems will display all the low-confidence results at the word or character level for
correction.
Today's most sophisticated data capture systems can deliver data to and interact with a range of
enterprise applications, ensuring accuracy and initiating business processes. For example, a system
might capture a vendor number from an invoice using OCR, validate it against a database and post the
complete transaction data to ERP so a check can be issued. At the same time the image could be
exported to a content management system through which the supplier could look up the record online.
If the supplier reports that the payment wasn't right, employees in accounting could access the
image to help resolve the conflict.
As in document capture, automation is easier if your forms are consistent, but several data
capture systems are now able to handle variable forms. Invoices, for example, all typically contain
the same data points vendor names, invoice/P.O. numbers, SKU numbers, quantities, subtotals and
totals but form layout varies from vendor to vendor. Unstructured forms processing technologies
can be trained to recognize such documents and extract the key data points. The technology isn't
perfect, but if you can automate 80 percent of the data entry on 100,000 documents, you could save
money and speed processes that were previously painfully slow.
While efficiencies have greatly improved, data capture from images can still be a costly,
time-consuming process. It's no surprise, then, that many organizations are turning to electronic
forms that can be published and accessed online. Recognizing the advantages and appeal of e-forms,
many forms processing vendors have added options or built-in tools for gathering data
electronically. One advantage of this integrated approach is that data from any source can be
validated and exported through a single workflow. Stand-alone e-forms systems are also available,
with Web- and PDF-based electronic forms being two popular means of gathering data online.
How to Use This Guide
Nearly every content management vendor that addresses document imaging offers a built-in or
optional document capture front end. Depending upon your application, this tool might be more than
sufficient for your needs. This product guide addresses only third-party products that are sold
apart from management systems. With few exceptions, these third-party tools tend to be more robust
or specialized in some way. For example, third-party products might offer more sophisticated
recognition technologies, higher-volume processing, specialized PDF conversion features or more
options for exporting and routing data and/or images. In addition, while you can use simple document
capture tools as a starting point for manual data entry, few content management vendors would claim
to offer robust data capture (forms processing) capabilities.
Capture starts with scanning. Thankfully, the latest scanners generally offer great image
quality. Many document capture products and all data capture products still incorporate image
processing features such as despeckle, deskew, background removal, text enhancement, and so forth,
but such features have been largely commoditized. As a result, this table doesn't address image
processing features.
Most of the categories in the table on pages 36
and 37 are self-explanatory. Some products are
exclusively aimed at document capture, some at data capture, but it is not uncommon for forms
processing products to handle both tasks. If a product claims to handle both challenges, make sure
it has all the features you'll need. For example, many document capture systems offer full-text OCR
and most have batch export capabilities that will place images and indexes in your back-end
management system; some data-capture systems don't include these features. On the other hand, many
document capture systems have built-in recognition engines, yet they don't have robust validation
and data export features required for forms processing.
Look beyond what you need to do today when considering features such as input and output formats.
Will you want to bring electronic documents and e-mails into the same capture stream alongside
bitonal images and faxes? Will you want metatags or captured data to be transformed into XML? Are
you likely to upgrade databases, and is there a plan to roll out an ERP system? Are prebuilt
integrations or export modules available for the technology you already have in place?
Also consider the possibility of bringing forms online. Some data capture systems support
electronic forms and documents as an input source, while others also have modules for publishing
electronic forms. The table on page 39
provides detailed coverage of stand-alone e-forms processing
systems and tools.
Consider Distributed Capture
One contentious issue among capture vendors is the question of support for distributed scanning
and validation. Nobody questions whether distributed approaches make sense. By capturing documents
at their source and sending images electronically, you can save time, improve service and expedite
financial transactions involving millions, if not billions, in revenue. Once information is captured
electronically, you can save again by sending high-volume data entry and validation work to low-cost
labor markets.
The real question is, how do you support a distributed architecture? Virtually every product
listed here can support distributed use if you provide a WAN or virtual private network (VPN) with
all the required connections. On the other hand, some vendors have developed combinations of
Internet-server-based systems and thin clients or browser plug-ins designed to simplify matters. In
most cases, these features are designed to let untrained business users choose a document type from
a drop-down menu, and the system will then automatically initiate the proper scanner settings,
workflow steps and indexing/data capture processes. Similarly, some data capture vendors have
created modules that harness the Internet to support remote validation. Because everything is
administered and deployed from the central system, you have complete control over the workflow and
you don't have to worry as much about remote software installs and support issues.
While thin-client/browser-based systems can offer advantages in terms of ease of use, ease of
deployment and flexibility, they are generally slower and less capable of handling production
volumes and speeds than conventional thick-client software. If supplying network connections and
supporting software at all of your distributed locations isn't a big deal, thick clients may be a
better choice. Another option some vendors choose is to supply Citrix or other terminal servers that
let workers access a central system from a remote location.
Look for the Vendor's Specialties and Strengths
If your needs are routine, you might find more than enough capabilities in a bundled or low-cost
capture system provided by your content management vendor. For many of the third-party products
included in this chart, specialization is the name of the game. And who wouldn't want to consult a
specialist? If your company processes medical claims, by all means choose a vendor that has
claims-specialized software and lots of experience handling claims. Look for relevant industry case
studies online and ask for contact information for reference customers. It's helpful to try before
you buy. Some companies offer trial software, or, if you're contemplating an enterprisewide
deployment, you can start with a pilot project or departmental rollout before you sign a big
contract.
Our product guide is a starting point rather than the final litmus test for your selection, but
it's designed to help narrow the search to the candidates that best fit your needs.
|
|