|
June 2001
SCAN & CAPTURE
Make Use of PDF Image Content
by Adam Throne
PDF is the de facto standard for publishing images online, but many users are unaware that there are three types of PDF files: Image Only, Image Plus Text and PDF Normal. Image Only files do not include optical character recognition results, so they are usually the smallest of the three file types.
While some businesses and government agencies choose to post Image Only PDFs online - to speed viewing and spare network bandwidth - this leaves users frustrated when they attempt to search or copy text. Enter iCopy, an Acrobat Reader plug-in from Image Solutions Inc. (ISI) of Morristown, NJ.
ICopy performs optical character recognition on image files within Acrobat Reader, and it can copy and paste the resulting text (as well as images) into any word processor. ICopy lets you extract as much as an entire document, and it also works with the other two PDF file types by copying the available text.
Quick Scan
Supplier: Image Solutions Inc., Morristown, NJ, 973-292-6444, www.imagesolutions.com
Product: iCopy
Description: Acrobat Reader plug-in that lets you recognize (OCR) and copy text from Image Only or text-based PDF files.
Strengths: Fast, reliable, inexpensive recognition of text within PDF Image Only files and easy copying of images. Support for nine languages.
Weakness: Doesn't recognize multicolumn formatting.
Price: Starts at $95 per seat; volume discounts available.
|
ICopy's plug-in design makes it easy to use. Once installed (on Acrobat Reader versions 3.02 and higher), three buttons are added to the Acrobat Reader tool bar. One button lets you select the portion of text you wish to copy; the image snippet is immediately recognized and the results saved to a clipboard for pasting into a word processor. If you wish to copy multiple paragraphs, you can maintain the paragraph structure by selecting an "additional line breaks between paragaphs" setting.
Another button handles multipage documents. You can choose to use optical character recognition (OCR) for some or all of the pages in the document, and you can save the results to the clipboard or text file. We were able to use OCR on a text-heavy 100-page document in about four minutes on a 650 MHz PC.
You can adjust for varying resolution levels up to 400 dpi, but as in any OCR operation, accuracy varies depending on type size, font and image quality. ICopy recognized 10- and 8-point type highly accurately, but was less accurate when recognizing 6- and 4-point type.
A properties menu offers choices including an OCR confidence setting. When confidence falls below the threshold you select, iCopy can insert any character (such as an asterisk) into the text. This allows you to use a spell checker to review the accuracy of the results. There is also an anti-alias text setting for grayscale and iCopy works in nine major European languages.
A third iCopy button allows you to copy and paste photographs, graphics, nonmachine-readable text or tables. You can adjust the resolution and alias settings of these images.
The one thing that frustrated us about iCopy is that it won't recognize multicolumn formatting - though this is a common complaint about many PDF optical character recognition systems. The OCR engine scans all the way across the page - stringing together disjointed lines of text - rather than finishing one column and then returning to the top of the page to read another. The only way around this is to manually copy and paste one column at a time.
ICopy starts at $95 per seat, with volume discounts starting at five or more users. While it has been most popular in governmental and legal research applications, iCopy is an answer for anyone who frequently encounters PDF files that can't be searched or copied.
|