Intelligent Enterprise featuring Transform
START NEWS & ANALYSIS OPINION CHANNELS PRODUCT GUIDES REVIEWS TECHWEBCASTS
CONTACTS ARCHIVES ADVANCED SEARCH
Rate & Review
Letter to the Editor
E-mail Article
Print Article
March 2002

Information Lifecycle

Less Means More Accuracy

by Julie Gable

Recent tests prove that the same may be true for autoclassification products. "The real issue is pulling features out of documents," says R. Kirk Lubbes, president of Records Engineering, Reston, VA, a firm that works with the intelligence community on autoclassification projects. To paraphrase Tyler, it seems that any document can be classified if you leave enough of it out.

Autoclassification software works by extracting features. Filters distinguish paragraphs and sentences, then strip out low-value words and phrases. The software generates numeric values for the remaining text based on metrics that may include key word frequency, pattern matching, concept counting or artificial intelligence. The result is a string of numbers that comprise the document's signature.

For example, two documents score high for mention of "wizards." One also scores high for "wands" while the other mentions "templates." The wizards and wands signature is probably about Harry Potter, while the wizards and templates signature is likely about desktop application software.

Autoclassification software plots document signatures as vectors. Where the endpoints of vectors are close together, a cluster forms. Clusters describe related documents. Most software allows the user to define a circle around these clusters (called a centroid), in effect saying, "Any document whose numeric signature falls within this circumference belongs in category X."

"What characteristics do you use to build the document's feature vector, recognizing that in doing so you throw data away?" asks Lubbes. "This is the real differentiator among products. The front end biases the outcome, regardless of what process is used to classify the documents."

Autoclassification is a boon in situations where large collections of diverse documents must be sorted. Acquisitions and mergers frequently involve inherited servers containing heterogeneous mixes of materials — a situation where any classification is better than none at all. Running an auto-classification engine on such a collection identifies its range of content, aiding in separating meaningful documents from useless ones. Even if some documents land in the wrong categories, the error is consistent, unlike human sorting efforts.

In situations involving high-value documents, in which the impact of classification error is significant, it may be preferable to devise specific categories and then assemble training sets of documents that establish the centroid for each category. The costs of identifying such documents can be considerable: $25 to $100 per document, by some estimates. Such high-stakes circumstances also demand that you correct misfiles. Statistical classification software recalculates a category's centroid whenever documents are added to it, so retaining incorrectly classified items causes the category to creep or drift over time.

Tests indicate that accuracy rises with fewer categories, with ten cited as the optimal number. An 80 percent accuracy rate is considered good, but nested subcategories dilute the rate considerably. A taxonomy that contains the main heading "streets," the subcategory "signs" and the sub-subcategory "traffic management" will actually garner an accuracy rate of .80 x .80 x .80, or about 51 percent.

A better gauge for acceptable accuracy is the business objective. The bottom line is that distilling the document's essence, training the software and sorting into fewer buckets increases the chance of success.

Julie Gable (juliegable@aol.com), CDIA, LIT, is an independent consultant based in Philadelphia.




Channels
Business Process Management
Content Storage
Content Management
Compliance
Enterprise Solutions
Document Scanning & Capture
Content Delivery & Publishing
Collaboration & Knowledge Management
Search and Classification
Locate an article from our print magazine. Just enter your Locator ID Number below.
ID#


NEWS FROM THE PIPELINE

OpenOffice.org 2.0 Closes On Final

New Study Finds Steep Growth For Smartphones

PalmSource Sale Cleared By Federal Agency

CTIA Panel Examines Enterprise Security Risks

[more]






HOME | ARCHIVE | REALWARE AWARDS

A Publication of the Network Computing Enterprise Architecture Group
Brought to you by CMP Media LLC, Copyright © 2005
Privacy Statement | Your California Privacy Rights | Terms Of Service