|
November 2000
AT YOUR SERVICE:
Compact File Choices for Web Delivery
By Julie Gable
Getting high-quality paper documents onto the Web involves
trade-offs. Cost, quality and file size are interwoven issues that can
tangle decision making. New products with better compression algorithms
for text and photographs can help, but not without a price. Here's what
to consider.
Is the document collection's value greater than the cost to
convert it to a Web-ready format? The PDF developed by Adobe, San Jose,
CA, is an easy choice for converting paper documents to the Web because
it retains the original's look and feel. With PDF, however, the
conversion costs rise with the need for smaller files. Here's why.
Acrobat can produce three file types: PDF Image Only, PDF Normal
and PDF Image Plus Text. [The last two have been renamed PDF Formatted
Text and Graphics files and PDF Searchable Image files, respectively, in
Acrobat Capture 3.0.] According to Tony McKinley, author of "Paper to
the Web," a typical page of text scanned at 300 dpi and compressed with
Group 4 is 50 K. When converted to an Image Only file, the PDF "wrapper"
adds another 5 K for a file size of about 55 K per page. Any user with
Acrobat Reader can view the image.
Image files converted to PDF normal undergo optical character
recognition (OCR) so that the text becomes searchable, and the image
file gets discarded. The resulting file size of about 10 K per page is
easily downloadable. One difficulty with OCR on a PDF file is that it is
correctable but not editable; that is, changes to the OCR text will not
automatically flow from line to line as they would with a word
processing file. As with any OCR job, the file requires cleanup, a
manual process that adds costs to conversion.
PDF Image Plus Text format retains the image file and places the
converted text file behind it. The user searches the text file but
actually sees the image. The trade-off here is that the text file may
not require as much cleanup, but the file size is about 80 K to 90 K per
page. A 100-page document scanned to this format will take ages to
arrive across a LAN, will require several minutes to open and exceed the
capacity of a floppy disk.
PDF may not be the right choice if the documents to be scanned
contain many color or halftone photographs. The scanning resolution of
300 dpi needed for high-contrast areas such as text is actually higher
than the 72 dpi to 100 dpi resolution needed for a photograph's
continuous tone. The photos don't compress well using Group 4, so the
scanned page file can be a megabyte or more even after compression.
[Adobe has improved on color and graphics compression with Acrobat
Capture 3.0, but not enough to defeat bandwidth challenges.] Conversely,
JPEG compression, which does pixel averaging for photos, erodes text
quality. PDF Normal will reduce file size in this instance, but the
trade-off is the labor involved in OCR cleanup.
In another alternative, a service bureau can scan pages once to
capture text, then again on a color scanner to capture the photo. The
photo's color image is clipped with a tool like Photoshop and reinserted
into the text page. The manual effort can drive up conversion costs to
$5 or $6 per page. Tom Johnson, president of Root Technologies,
Princeton, NJ, faced this situation in converting The Journal of the
Acoustical Society for the Web.
"The back issues of the journal, from 1926 to 1996, had 200,000
pages, with many color and halftone photographs," says Johnson. "We
needed a way to produce 75 K files suitable for Web viewing."
Root chose DjVu from LizardTech, Seattle. DjVu was developed by
AT&T Labs expressly for scanning pages with text characters and
pictures. Once scanned, the two object types are placed in separate
layers and compressed with different methods. Both are lossy methods,
but do not affect document readability. DjVu also eliminates redundant
character information. For example, if the character "e" appears 100
times, DjVu can store the compressed image of the character once, with
99 pointers to its other locations. The resulting files are 30 percent
smaller than TIFF images or PDF Image Only files, without the need for
the labor-intensive work associated with producing PDF Normal or color
image reinsertion. For color pages at 300 dpi that contain text and
pictures, DjVu files are generally five to eight times smaller than GIF
or JPEG.
"DjVu's new searchable text feature makes it the format of choice
for scanning document images to the Web," says James Rile, president of
independent consulting firm Rile Associates, Phoenixville, PA. Rile has
done extensive work with PDF, but he found that a 32-page, full-color
magazine averaged 40.6 K per page with DjVu.
So what's the trade-off? Displaying DjVu files requires the DjVu
viewer (free at www.djvu.com or
www.lizardtech.com as a one-click
download). But the viewer works only in a Web browser, not on the
desktop like Adobe Acrobat Viewer, and without Acrobat Viewer's
navigation features. To create DjVu files, LizardTech sells an
enterprise version for $7,000 per CPU and a personal version for $250. A
workgroup version is planned.
DjVu offers an alternative to costlier conversion methods,
potentially tipping the balance for placing high-quality content on the
Web. Its smaller file sizes will appeal to users with little patience
for slow downloads, and they'll be rewarded with excellent text and
photo quality. Whether users will judge a site's content worth the
effort of loading a special viewer remains to be seen.
Julie Gable (Juliegable@aol.com), CDIA, LIT, is an independent
consultant. Product mention should not be construed as an
endorsement.
|
|