December 2001
Taxonomies Put Content in Context
by Russell Letson
If this were the year 2001 as predicted by Sir Arthur C. Clarke and Stanley Kubrick in the movie
"2001: A Space Odyssey," you wouldn't be reading an article about the issues surrounding
computer-powered taxonomies. Instead, you would ask an all-knowing computer about taxonomies in
plain English, and it would understand (or ask a few questions to clarify your context and
intentions). Then the computer would sort through vast data libraries at blistering speed and
provide the answer in a few seconds.
But the problem of designing systems that have a humanlike understanding of words, concepts and
categories is far from solved. It's a complex and often dauntingly technical field that computer
scientists, artificial intelligence researchers and linguists have been working on for decades. Now,
businesses facing the need to quickly and reliably find useful information in the mountains of
unstructured content held in corporate repositories (and made accessible through portals), are
looking to taxonomy technology for help.
Hadley Reynolds, research director at the Delphi Group, Boston, calls it the "infoglut problem,"
and he sees simple search, particularly keyword search, as "much too crude a weapon" against it.
"What the users of portals want when they go to a search box is answers," he says, "but they don't
get answers. They get lists of documents that allow them to spend more time trying to refine what
they get from those documents and figuring out whether it is actually relevant or not."
Sue Feldman of IDC, Framingham, MA, puts it this way: "People have trouble finding information in
invisible information spaces. ... You have a search box and no way of knowing what's behind it."
One way of structuring heaps of unorganized information is to implement a taxonomy, which serves
as a navigational tool as well as an organizational tool. With a taxonomy, "people can automatically
see what's there," explains Feldman, director of IDC's document and content technology program.
"[They can] run their eyes over the top level of the taxonomy and see whether it's a place they want
to search in or is unlikely to have anything that interests them."
As significant a role as a taxonomy plays, chances are you don't see the application itself when
you look at a portal page or search engine screen; taxonomies tend to be part of the information
infrastructure rather than the front end. But when you see a subject area divided into a set of
topics, or when a search returns not a jumble of hits but an array of possibilities sorted into
categories, it was a taxonomy engine of some sort (or human librarians) that generated the
categories.
Algorithms and Approaches
Plenty of vendors offer categorization solutions, and they engage in the inevitable arguments
about which technology is better. The taxonomy market, says Reynolds of Delphi, is "an
algorithm-based industry, and most of the companies have taken a very technology-centric approach,
which is to try to commercialize a particular algorithm or set of algorithms."
The algorithmic and computational approaches break into the following families:
Rule-based: Rule sets contain keywords and logical relationships according to which target
documents are automatically analyzed and sorted. Rules are generally designed by human experts but
may be generated by analyzing a sample body of documents. Verity offers technology using this
approach.
Statistical analysis and pattern matching: Algorithms analyze texts for the presence, frequency
and relative location of terms. The program may identify patterns that emerge from the target
documents themselves or be taught what to look for from training sets of documents. Autonomy,
Quiver, Semio and Smartlogik are among the vendors following this path.
Linguistic analysis: Algorithms identify the important linguistic and semantic elements
(understanding phrases or the parts of speech) in the target documents. This identification can be
used in conjunction with statistical analysis. Inxight and Semio have algorithms taking this
approach.
Ontologies: Human-developed data structures that combine elements of taxonomies and thesauri and
include information about relationships among terms. Target documents are analyzed against its
structures. Applied Semantics helps companies develop ontologies.
Some players are moving away from single techniques and toward combinations of approaches. For
example, Quiver, San Francisco, uses two different statistical methods to refine its results, and
Inxight, Palo Alto, CA, runs documents through linguistic analysis (capable of handling a dozen
languages) and then statistical algorithms. The hybrid champ may be Stratify, Mountain View, CA,
which employs five computational techniques to analyze content"Bayesian probability" and "support
vector machine" statistical analysis, keywords, and source- and rules-based decisionsand feeds
the results into a proprietary combiner element to make the category assignment.
"Each of these technologies has different strengths and weaknesses," says Stratify chief
technology officer Ramana Venkata. "It is logical to expect that a classification technology that
leverages several techniques but knows how to arbitrate among these technologies is going to perform
better than any technique individually."
Leaving the technical-issues debate to the specialists, it's possible to list some general
questions you should ask about a categorization solution:
- How accurate and reliable are the category assignments? If the system constructs the taxonomy,
how well does it represent the topics you need?
- Is the combination of automation and human effort appropriate for your setting? Where is
automation invested-in constructing categories, writing rules, selecting documents for training
sets? Can you adjust the degree of automation?
- How flexible is the system? Can the categories be tweaked for better accuracy? Can the system
spot new kinds of items and adjust the taxonomy or alert a human to do so?
- Does the system integrate with, or map to, standard taxonomies? If your organization already
has some knowledge management technology in place, will it integrate with that?
- How does the system scale? Can it grow to accommodate a large volume of content and a large and
complex taxonomy? Does it need to?
- Does the system offer personalized responses for individual users? Does it integrate with other
applications? Is its output in a standard format?
The proper context for these questions is a clear understanding of the settingbusiness and
technologicalfor the taxonomy. What do you hope to accomplish? Is the main function to expose and
unify content for internal applications such as human resources or customer relationship management?
Or will the taxonomy application enable sales and marketing people to trawl the Web for business
intelligence? Is there a vertical-market angle, such as pharmaceutical or financial service, that
requires expertise on the vendor's part? Will the taxonomy serve a company portal with general users
and open Web connections?
Manual vs. Auto Classification
Intertwined with the debate about computational techniques is a more general question: How much
of the job of analyzing content and creating categories can be automated and where is human
intelligence required? Manual systems can reflect real human understanding of content and of the way
people might actually associate topics. The Internet portal Yahoo!, for example, employs an army of
categorizers to improve the accuracy of the company's search engine. This approach is
labor-intensive and costly, and doesn't scale well. In addition, human categorizers aren't
necessarily consistent in assigning categories.
 |
|
Manual and automated approaches to building
taxonomies both present pros and cons. Companies have to strike a balance between cost and
accuracy.
|
Purely computational approaches, on the other hand, may not deliver the combination of accuracy
and precision that people need in their work. John Lehman, president and CEO of Sageware, Mountain
View, CA, says that automation advocates have "always hoped that there was a magic algorithm that
would stick this virtual electrode on your head and no matter what you said, it would know what you
meant as far as information retrieval was concerned."
But software by itself can't read minds, and it's not much better at reading text. Ian Hersey of
Inxight, Palo Alto, CA, points out that the best algorithms availableunder optimal conditions,
with hundreds of training documents and narrow data setscan manage 75 to 80 percent accuracy in
categorization. On a typical intranet, with its broad range of content, the best tools, with the
training sets of 10 or 20 documents, are getting 50 to 80 percent accuracy, "which means half the
content is missing or miscategorized," Hersey says. "That's just not acceptable."
Automation alone wasn't enough for one Fortune 500 corporate portal administrator who wished to
remain anonymous. He needed to classify 500,000 documents from a number of different organizations
and wanted at least 90 percent accuracy. "You can't tell me that any technology, machine-trained or
not, can be accurate enough," he says. "You have to have technology that's smart enough to point out
what needs to be moved and what doesn't."
What this user wanted was "the best of both worlds" of manual and automated approaches. "People
are so focused on trying to make it easy for the general user or on making the product trainable
that they forget about the human interaction," he says. "The challenge is to find an automated
solution with human intervention built into it."
The portal administrator's answer was to add Quiver's QKS Classifier to his Inktomi search
engine. QKS is an "editorial-assisted algorithm" according to Quiver vice president of sales and
marketing Andrew Feit. The categorization engine screens documents and suggests topics, but then,
Feit explains, "That output flows into a workflow environment that allows humans to see the results
of the categorization engine, agree with them, override them and put things where they belong."
QKS Classifier can be tuned using topic-by-topic business rules that allow some topics to be
automatically published with no human intervention. The human information management can be
centralized or distributed, so it's possible for departments or individuals to have ownership and
control of topics in their areas of operation and expertise.
How much human intervention is needed, and where should it be applied? Delphi's Reynolds says, "I
think that more companies are going to be looking to that whole-product approach. Rather than buying
an algorithm, we want an environment in which to work with the stuff. Algorithms are going to be
important underneath, but what we really need is a vehicle for the human taxonomy managers or portal
editors or subject matter experts to have input into the process."
 |
|
Some systems combine human interaction with automation tools in order to
"train" the technology to categorize content appropriately.
|
The Payoff in Taxonomy
For technology that's generally buried in infrastructure, a categorization tool can do some
impressive tricks, even beyond making that shiny new company portal more tidy and responsive without
hiring a staff of digital librarians to organize its virtual shelves.
The benefit is obvious for commercial search engines and news organizations. Inxight's Hersey
says his company licenses its technology to a number of search engines, but its best direct market
has been content publishers such as news services. "They tend to have very large, dynamic taxonomies
since news organizations change topicality a lot," he explains.
For example, Factiva, the Dow Jones-Reuters content provider, maintains 1,500 topics. It chose
Inxight's software for the ability to handle multiple languages and automate classification. The
system's core job is applying metadata to incoming stories so that they are assigned to the right
topic categories and can then be delivered to the right customers. Even the training stage of
implementation yielded useful results, according to Chris Porter, Factiva's director of coding
systems.
"The system pointed out errors of application and helped us to see where we weren't as clear in
our coding concepts as we should be," Porter states.
The automated end of the system now works as intended: freeing the humans to make human
judgments. "We agreed on a target with Inxight that at least 45 percent of stories passing through
would need no manual attention," he added. "Test are currently showing that it's running at between
60 and 80 percent."
The BBC turned to Smartlogik, London and San Francisco, for similar services. "We manage [the
BBC's] corporate taxonomy to provide [it] with news," says Smartlogik's vice president of marketing
Roger Frugia. "We manage the categories and built the rules, and as new information comes out from
the news feeds, it automatically gets categorized and sent to the desktops of the people doing
research. It's taxonomy and categorization."
Smartlogik's taxonomy and discovery products provide the automation behind NEON (news information
online), which replaced a manual, paper-based story-clipping and archiving service that previously
served the BBC's newsrooms. Several thousand researchers and journalists use NEON for fast access to
new and archived stories. The company estimates direct savings and productivity improvements
amounting to more than $40 million a year. And, stated BBC Information & Archive projects manager
Adam Lee, "NEON allows the BBC to get accurate stories on the air before [its] competitors."
For those not in the news business, the usual productivity benefits apply. According to Lehman of
Sageware, taxonomy technology can help companies generate profit or cut costs by sorting through
vast stores of information quickly without legions of researchers. He says Sageware did exactly that
for KPMG. The consulting firm wanted a system that would encompass its existing internal corporate
knowledge base as well as the thousands of information sources it licensed from all over the world.
KPMG's "real education process" emerged when it mapped the way it described its products and
solutions internally against external descriptions.
"It was pretty easy to categorize KPMG's marketing literature and engagement letters and
proposals and reports, because it was all in the lingo of KPMG," Lehman says. "But the outside world
didn't write anything that was already entered around KPMG's products and services. It took us some
time to convince [KPMG] that [it] needed to look at the business functions, problems and events that
would cause companies to look for the services that KPMG provides."
Perhaps the most mysterious and exciting promise of these solutionsespecially for a
corporation's research branchis the possibility of discovering something you didn't know was in
your information hoard. The statistical-linguistic algorithms are well suited to noticing new things
in their document collections.
For example, Pamela Crowley, director of public relations at Semio explains that a given search
with the San Mateo, CA-based company's technology returns not only the hits with a reliability
rating but also lists of related categories to which a given document might also fit. "You actually
see the applicability of the data within other categories and what other categories are related to
your search," Crowley says.
Sweden's largest medical university, the Karolinska Institutet, reports that its 2,000
researchers have been able to pursue entirely new lines of research using a Semio-based Biomedical
Text Mining portal. The system not only returns searched-for documents but it can also reveal
linkages among documents that hadn't been seen before.
"Researchers don't always know in advance what they are looking for or how to find it," stated
Karolinska Institutet librarian Catharina Rehn, M.D. "Once they are on a path, they need to know
that they are viewing all available information relevant to their subject. They can't afford to make
decisions based on only 80 percent of the data."
Some vendors of taxonomy technology are benefiting by using their own technology internally.
Autonomy's director of business development, Gary Bryan, has his company's Active Technology
application read over his metaphorical shoulder as he works. "When we respond to RFPs, we actually
have the technology read along with us," he says. "As I'm working, the system might inform me that
there's an RFP [in our files] that is already 85 percent similar to the one I'm working on. I didn't
know it was there; it told me it's there. I can grab it and look at it and start comparing sentence
by sentence and running it against my known base of documents and use it to get the job done faster,
instead of writing from scratch every time."
This technology may not be quite like asking HAL, the computer featured in "2001: A Space
Odyssey," to help with your homework, but it at least gives you a hint of futuristic technology.
Russell Letson (rletson@cloudnet.com) is a
freelance writer based in St.Cloud, MN.
The ABC's of Taxonomy & Classification
"Taxonomy" comes from a Greek word meaning "arrangement" or "order," and that's what it is: a way
of ordering or arranging a body of unstructured information so that we can make sense of it and find
individual items in it.
At its simplest, a taxonomy is a set of buckets or binstopics, headings, categoriesinto
which items can be sorted. The buckets are chosen according to an organizing principle that
determines what goes where, which makes that content meaningful and usable for humans. The
organizing principle might be a logical structure akin to the biological taxonomies applied to
plants and animals or a set of conventional or arbitrary categories useful in a particular setting.
For example, the division of news stories into business, sports and entertainment sections.
(Exercise: Imagine a news story that can fit into all three of these categories.)
Organizing a body of unstructured content requires construction of the taxonomy, analysis of the
content to determine the meaningful elements (words, phrases, clusters, concepts and so on) and a
way of connecting the content elements to the topic elements of the taxonomy.
Of course, people do this all the time. The comprehensive Library of Congress cataloging system
shows what can be accomplished by human efforts. Applying computer technology to the problem means
getting software to approximate what humans do easily when they read: to see connections, recognize
context, understand implications and in general derive "meaning."
"Approximate" is the crucial word here because none of these analytical engines really read the
target text. Instead, they draw patterns or statistical measurements from it and make inferences
about topicality based on the patterns or measurements. The degree of accuracy that can be expected
from an entirely automated categorizing application is one of the big questions users must consider
when implementing this technology.
Taxonomy & Classification Glossary
Bayesian analysis: widely used statistical technique for analyzing text; infers topicality from
patterns of words and phrases present in documents; a "probabilistic" method because it returns a
likelihood of a document belonging to a topic
Disambiguation: process of distinguishing which of several meanings a word or phrase might be
correct
Document vector: measurement of how closely documents are related to each other; derived from
statistical analysis of content
Granularity: degree of detail or discrimination in a taxonomy; number of categories or topics
available for documents to be assigned to
Manual tagging: humans reading and assigning documents to categories or attaching metadata
Ontology: a data construct that reflects the structure of a body of knowledge by including
categories, vocabulary and information about relationships; generally assembled by humans and then
used to analyze documents
Rule-based categorization: system of assigning documents to categories according to rules,
generally devised or optimized by humans
Statistical text analysis: techniques that make inferences about content by measuring frequency
and placement of words and phrases; particular techniques include naive Bayes (see Bayesian
analysis), support vector machines and K-nearest neighbor
Taxonomy: a structure of categories or topics to which documents can be assigned
Training set: a body of documents used to exemplify the documents belonging to a given
category
|