|
March 2000
New Directions in Data Capture
By Penny Lunt
Do you want to put those high-volume forms on the Internet to ease the burden of paper
processing? Do you have handfuls or scores of locations all shipping paper off to a central site? Are
you tired of sorting and batching paper-based documents that are similar but not identical (e.g.,
purchase orders)?
The latest document and data capture solutions offer answers for all these challenges. They also offer
more built-in functionality and customization features that will give your specific application that
much more power and efficiency.
The basic things you look for in a data capture or forms processing application are accuracy, speed,
scalability and reliability. Most forms processing software products now provide image enhancement,
forms ID, OCR and ICR (machine print and handwriting recognition), database lookups, rules and
validations, and key from image. Reviewing the newest features in this year's crop of upgrades,
we found three common themes: e-savvy capture, remote capture and freeform recognition. To go beyond
the marketing spin, we've tracked down users who have actually tried these new features. Read on
to learn from their experience and advice.
Processing Paper, Fax and Internet Input Streams
When drug companies want research on what drugs doctors prescribe, they often turn to Health Products
Research (HPR) in Whitehouse, NJ. Merck, for example, might want a report that shows its popularity
versus competitors and the sales performance of each of its drugs. HPR's surveys are as long as
12 pages, with questions on the front and back. They contain check boxes, bubbles and handwriting
(where the doctors explain why they prescribe particular drugs). HPR receives 600 to 1,000 of these
surveys a day.
HPR wanted to start accepting electronic as well as paper forms. "When doctors fill forms out
electronically, it's easier on us," explains Bret Piano, executive director, software
development. "There's less to verify and it's easier than reading handwriting.
Doctors' handwriting isn't the easiest thing to read." On the other hand, not all
doctors have access to the Internet or email. Piano estimates that only 35% of doctors have computers.
So it was equally important to continue mailing and, in the future, faxing the surveys.
Last September, HPR purchased Teleform Enterprise (which starts at $8,995) from Cardiff Software
(www.cardiffsw.com), and they have since implemented paper-, HTML- and PDF-based forms processing.
Paper forms don't have to be batched. They are identified and processed automatically.
"We just throw all the mail we get into the scanner (an 8080D from Bell & Howell) and
Teleform recognizes what type of form each one is," says Piano. The software can handle forms
designed within the software as well as pre-existing forms. It accepts address changes and status
changes (such as a doctor's retirement or death) and updates the HPR database accordingly.
Piano says he liked the way he can write "helper applications" to modify Teleform. These are
written in Visual Basic and access Teleform's Pervasive SQL database. HPR added a helper
application that lets two operators verify different fields on the same form at the same time. One
person might be looking only at company names while the other might be looking only at email
addresses. HPR is also customizing the dictionaries used for validation and word fill-in. If the
operator wants the company "Pfizer," they can simply type in the "Pf" and the rest
is automatically filled in, saving keystrokes. "We wrote a little application that goes into
the back end and updates those tables," he explains. "There might be a new company called
Pfizer-Smith.' We can update that in the dictionary." HPR has used Cardiff's
HTML- and PDF-based modules for placing forms online. Piano reports no glitches in using the HTML
forms module. They use ASP scripts to get Teleform to read emails delivered from the Web server.
Thus far, Teleform PDF+Forms has been used for one project with mixed results. "We liked
[PDF+Forms] because we could email the file to the doctor and the doctor could fill it out on a PC or
laptop or print it out and fill it out on a plane, then send it back at their leisure by email or
fax," Piano says. "We could design the form once and put it on the Web and paper."
There were, however, some technical hiccups. HPR had difficulty populating Adobe's proprietary
FDF (file data format) with their own data to create the forms. And some doctors with older versions
of Adobe Acrobat Reader were unable to fill out the forms. Piano says HPR will figure out the
workarounds and try PDF+Forms again.
Piano has been satisfied with Teleform's accuracy and speed. On one project, 9,000 forms were
returned from the post office with notices that the doctors had moved. It took a mere three hours to
scan all those forms and collect the doctors' ID numbers. Only 200 contained questionable
characters that needed to be verified. Soon HPR will try Teleform's fax feature, which pulls
faxed-in information from a fax server and processes it like any other form.
|
E-Savvy Capture
Description: In this scenario you can use a single platform to capture and
process paper, Web, fax, email and/or EDI input streams. Everything is processed with the same rules
and can be exported to the same or, in some cases, multiple destinations.
Benefits: Consistent data through uniform application of rules across all input streams. Single
investment and training program saves time and money. Forms have a consistent look and feel for
customers, partners and employees no matter how they interact with you. Complete reports spanning all
input streams support better decision making.
Vendors: Cardiff Software (www.cardiff.com) supports HTML- and PDF-based online forms as well as fax
input (see case study above). OCR For Forms from Microsystems Technology (www.microsystemsonline.com)
lets you create online entry screens and Internet forms. Data collection is handled through a Java
applet or dynamic HTML. Three more e-savvy offerings are planned for the second quarter. Captiva
Software (www.captivasw.com) will add Internet forms support to FormWare 3.0. Input Software
(www.inputsoftware.com) will support both capture and distribution of data through EDI, BizTalk, XML
and e-mail in its InputAccel product. FormWorks 3.5 Internet Edition from Recognition Research
(www.rrinc.com) will provide Web-based image retrieval and Web-based forms with online field
validations and lookups. It will incorporate EDI features as well as all the FormWorks paper forms
processing capabilities.
Distributed Capture
Description: Remote capture lets you scan forms and other documents at satellite
offices and pass the images over a LAN, a WAN or the Internet. Paper is retained locally while
indexing, validation and processing can be handled centrally.
Benefits: Rapid return on investment through savings on shipping costs and faster
processing and approval time. This isn't entirely new, but it's just catching on at banks,
brokerages, trucking companies and other organizations that need quick turnaround times.
Vendors: Kofax supports distributed capture through Ascent Capture 3.0 (see case study
above). Input Software not only supports distributed capture with InputAccel, it lets you export data
to multiple systems in multiple locations be it databases, ERP systems, document archives, etc.
Microsystems Technology's OCR for Forms supports remote scanning over the Internet and local
networks. Captiva's FormWare 3.0, to be released in April, will include remote scanning clients
that will communicate with the central server via File Transfer Protocol (FTP). Cardiff's
Teleform system supports fax-based capture.
Freeform Recognition
Description: The ability to capture and process similar forms that don't follow a
designated template, such as invoices from different suppliers or purchase orders from different
customers.
Benefits: Lets you automate processing of unstructured forms that until recently required manual
keying. Purchase orders, delivery logs and certain legal documents can have their data automatically
stripped out and entered in a workflow.
Vendors: Mitek (www.miteksys.com) and Ceresoft (www.ceresoft.com) both offer technology they call
"document understanding." Input Software is adding technology licensed and enhanced from
Mitek's Doctus and Cogniform products to InputAccel 3.0, which will be released in April. Captiva
will be releasing its own freeform recognition system with FormWare 3.0, also set for an April debut.
|
Capturing Data & Docs From Remote Locations
New account documents at brokerage firm J.C. Bradford used to travel by interoffice envelope from 80
branches to the headquarters mailroom in Nashville. From there, the documents entered a workflow that
passed through several departments. At various points in between, many of these documents were lost
and had to be recreated by branch staff or, worse, prospective customers. Lost documents also meant
regulatory trouble when a stock exchange or the SEC made one of their frequent, random compliance
checks.
To rectify this problem, they sought a remote capture and forms processing system that would let them
scan new account documents at the branches and index and process the images quickly at headquarters.
"These documents have to be processed in a timely manner," says Thurman Bush, imaging
administrator. About 350 to 400 new accounts are opened a day, generating around 10,000 documents.
As of late January, 22 of the branches had begun scanning these documents and sending them to
headquarters using Ascent Capture 3.0 from Kofax (www.kofax.com). Eventually all branches will migrate
to the new system. Ascent Capture 3.0 software offers image processing features, but Bradford has
kept things simpler for remote operators by deploying Fujitsu scanners featuring Kofax's Virtual
ReScan technology. Virtual ReScan is a combination hardware and software system that makes automatic
improvements in image quality by referring to grayscale versions of the images. "That makes the
images trouble free," Bush says.
The people at each branch who handle the scanning batch documents into three classes: new account
forms, W9 forms and all others. The images are sent via T1 lines to the Nashville data processing
center after midnight, when network traffic is minimal. At the central server, the Kofax software
identifies and processes the new account forms and W9s as forms. Other documents are sent to indexing
workstations and then into an Eastman Software workflow system. If there's a problem with an
improperly scanned document, operators email that image back to the remote site with instructions to
rescan the documents. The branches keep the paperwork onsite for five days just in case. Once
processed, the imaged documents are stored offsite for seven years to meet SEC requirements.
J.C. Bradford has cut FedEx and other shipping costs and it has improved efficiencies using the new
system. The next projects will be automating purchase orders, expense reports and anything else that
is now sent through interoffice mail.
Document Understanding
Most forms processing products have automatic form identification that looks for registration marks,
certain words or numbers or even the topology of the page. This type of form IDworks great if your
forms have a consistent look to them. If your forms are things like invoices or purchase orders from
different companies with different layouts and terms, such automatic ID doesn't work. Freeform
recognition lets you process unstructured forms. This is new technology and no vendor was able to
provide a customer reference for it at press time. However, we think this feature could be valuable to
readers, saving the time of separating paper into batches.
Ceresoft's DocAgent software classifies documents into three groups. The first is forms that have
a similar format but have an indefinite number of items in a particular column, such as phone bills of
the same design. The software allows for an undefined column length. The second is forms that have a
similar but not identical layout, such as varying but similar health care claim forms. Here the
software uses a flexible version of template matching. The third type is forms that have similar
content but very different layouts, such as invoices from different companies. Here DocAgent OCRs the
entire page and applies an "Intelligent Script" to determine what data elements need to be
extracted and how to find them.
Mitek's version, Cogniforms, uses machine learning. You scan in samples of a form and highlight
the data elements that need to be captured. Cogniforms automatically learns the characteristics of the
document and picks up clues about where to find the data. For example, if every sample has the words
"Go to:" to the left of a billing address, then it will look for "Go to:" on
future forms. You can highlight fields and rows and tell the system that their lengths will vary,
which accommodates things like phone bills. Cogniforms could distinguish forms that are very similar
with rules, depending on the forms and how similar they are. For forms that are all completely
different, such as invoices from 300 different companies, you might be able to create a few templates
that cover groups of the forms. Captiva will be introducing freeform recognition with the next
release of FormWare, 3.0. Their technology will OCR/ICR a region of the document, several regions or
the entire document while using fuzzy searching to find a key word or data pattern such as
"invoice date." This works for completely unstructured documents such as invoices and
purchase orders. It also works for forms that have some fixed data elements and some floating ones.
Dialog boxes let you tell the software where to look for each element. FormWare could already
handle forms with sections of different lengths, such as phone bills, with "table zones."
More Innovations
This article focused on three types of upgrades in forms processing products. But manufacturers have
made other improvements in their software, and several plan to unveil these upgrades at the upcoming
AIIM show in New York, April 10-12.
Key-free indexing Input Software's InputAccel 3.0, to be announced at AIIM, will include a
new indexing module that will let you simply highlight the required fields rather than type them in
(they'll be automatically OCRed and processed). Input will be showing a beta version of
DynamicInput at the AIIM show, too. This will provide a Web form that doesn't look like a form,
it will provide more intuitive screens.
Faster form ID Microsystems Technology is soon to release a new form ID feature that is said to
improve throughput. It looks at the topology of a form and puts secondary and tertiary maps behind it
that provide a claimed 99.9% accuracy.
Color support and QC Recognition Research's FormWorks 3.5 Internet Edition (scheduled for
AIIM) will support BancTec scanners and scan applications and the Siemens ScanStar scanner, which
provides a scan module for dropping out mixed colored forms. RRI will also unveil a new SQC Tool for
quality control and reporting.
Recognition improvements Top Image Systems (www.topimagesystems.com) has added two new
elements to its AFPSPro software. SuperICR provides trainable ICR voting. Multiple ICR engines vote on
their recognition of characters. Voting is continuously compared to truth data tables until the ICR
engines learn to vote better. That reduces false positives. The cost to fix a false positive is ten
times more expensive than to just have someone type in the character, according to TIS America
president Joseph Busque. The second new thing, the Just ICR engine, is a trainable software
recognition engine. It was created for an application in Japan that had a stamp on it. By scanning the
stamp several times, they taught the system to recognize the stamp as a character. You could take a
bizarre font, scan it in and train the software to read that.
Clearly, forms processing products are heading in an electronic direction. Next month we will take a
closer look at e-forms and how they can be processed with or without traditional paper forms.
Upgrading and Customizing a System
Medical Diagnostics had a forms processing system capable of handling 30,000 double-sided forms
a day, but found out that it was impossible to upgrade. Having purchased the system only two years
earlier, the Miami office of this California-based company learned an expensive lesson.
"We didn't want to have any problems with coding or upgrades," says Art Kozel, systems
engineer. "We wanted to make sure the accuracy was good, with less than .05% substitution rate.
[We didn't want] bogus data in our database." Medical Diagnostic Systems manufactures
medical quality control products that are used by clinical laboratories worldwide to test and register
medical diagnostic tools. The Miami division provides quality assurance programs for customer labs.
They collect instrument readings from labs all over the world, tabulate the results and deliver the
data back to the laboratories.
The company needed a system that could recognize hand-written numbers and query a database to find an
acceptable range for each type of test. (This is called a "one-to-many" lookup versus the
"one-to-one" lookup that is common in forms applications.) They chose OCR for Forms from
Microsystems Technology, which let them easily create extra validations while leaving an easy path for
future upgrades. The system, which was installed last summer, reads a barcode at the top of each
form. The barcode provides references to the range lookups for each row on the form. OCR for Forms
recognizes the numbers, links up to Medical Diagnostics' Oracle database via ODBC and requests
the appropriate ranges. Numbers that fall out of range go to a verifier, as do any numbers under the
95% confidence level.
Medical Diagnostics' in-house software development team used simple
Visual Basic scripting to create the database queries between the Unix database and OCR For Forms on
NT. The company had other requirements. For example, if someone deletes a form, they want to know who
deleted it. That required more VB scripting.
About 40% of Medical Diagnostic's forms arrive by
fax, the other 60% come in through the mail. The system includes two identification workstations, two
Kodak Imagelink 500 Scanners, four extract workstations and ten verify workstations.
Kozel tested the system with a run of 12,000 customer forms. "We checked to see whether the
system could handle the volume, then went character-by-character to check how the system was
doing," Kozel says. "First we had problems with 4s and 5s. Then [Microsystems] switched us
to the Nestor [recognition engine], which did well with the numbers.
"The overall process has dramatically improved since we made the change," Kozel reports.
"The old system had problems with higher volumes. With this system the work is taking us a
fraction of the time."
|