|
March 2002
Information Lifecycle
Less Means More Accuracy
by Julie Gable
Recent tests prove that the same may be true for autoclassification products. "The real issue is
pulling features out of documents," says R. Kirk Lubbes, president of Records Engineering, Reston,
VA, a firm that works with the intelligence community on autoclassification projects. To paraphrase
Tyler, it seems that any document can be classified if you leave enough of it out.
Autoclassification software works by extracting features. Filters distinguish paragraphs and
sentences, then strip out low-value words and phrases. The software generates numeric values for the
remaining text based on metrics that may include key word frequency, pattern matching, concept
counting or artificial intelligence. The result is a string of numbers that comprise the document's
signature.
For example, two documents score high for mention of "wizards." One also scores high for "wands"
while the other mentions "templates." The wizards and wands signature is probably about Harry
Potter, while the wizards and templates signature is likely about desktop application software.
Autoclassification software plots document signatures as vectors. Where the endpoints of vectors
are close together, a cluster forms. Clusters describe related documents. Most software allows the
user to define a circle around these clusters (called a centroid), in effect saying, "Any document
whose numeric signature falls within this circumference belongs in category X."
"What characteristics do you use to build the document's feature vector, recognizing that in
doing so you throw data away?" asks Lubbes. "This is the real differentiator among products. The
front end biases the outcome, regardless of what process is used to classify the documents."
Autoclassification is a boon in situations where large collections of diverse documents must be
sorted. Acquisitions and mergers frequently involve inherited servers containing heterogeneous mixes
of materials a situation where any classification is better than none at all. Running an
auto-classification engine on such a collection identifies its range of content, aiding in
separating meaningful documents from useless ones. Even if some documents land in the wrong
categories, the error is consistent, unlike human sorting efforts.
In situations involving high-value documents, in which the impact of classification error is
significant, it may be preferable to devise specific categories and then assemble training sets of
documents that establish the centroid for each category. The costs of identifying such documents can
be considerable: $25 to $100 per document, by some estimates. Such high-stakes circumstances also
demand that you correct misfiles. Statistical classification software recalculates a category's
centroid whenever documents are added to it, so retaining incorrectly classified items causes the
category to creep or drift over time.
Tests indicate that accuracy rises with fewer categories, with ten cited as the optimal number.
An 80 percent accuracy rate is considered good, but nested subcategories dilute the rate
considerably. A taxonomy that contains the main heading "streets," the subcategory "signs" and the
sub-subcategory "traffic management" will actually garner an accuracy rate of .80 x .80 x .80, or
about 51 percent.
A better gauge for acceptable accuracy is the business objective. The bottom line is that
distilling the document's essence, training the software and sorting into fewer buckets increases
the chance of success.
Julie Gable (juliegable@aol.com), CDIA, LIT, is an independent
consultant based in Philadelphia.
|