Intelligent Enterprise featuring Transform
START NEWS & ANALYSIS OPINION CHANNELS PRODUCT GUIDES REVIEWS TECHWEBCASTS
CONTACTS ARCHIVES ADVANCED SEARCH
Rate & Review
Letter to the Editor
E-mail Article
Print Article
December 2001

Taxonomies Put Content in Context

by Russell Letson

If this were the year 2001 as predicted by Sir Arthur C. Clarke and Stanley Kubrick in the movie "2001: A Space Odyssey," you wouldn't be reading an article about the issues surrounding computer-powered taxonomies. Instead, you would ask an all-knowing computer about taxonomies in plain English, and it would understand (or ask a few questions to clarify your context and intentions). Then the computer would sort through vast data libraries at blistering speed and provide the answer in a few seconds.

But the problem of designing systems that have a humanlike understanding of words, concepts and categories is far from solved. It's a complex and often dauntingly technical field that computer scientists, artificial intelligence researchers and linguists have been working on for decades. Now, businesses facing the need to quickly and reliably find useful information in the mountains of unstructured content held in corporate repositories (and made accessible through portals), are looking to taxonomy technology for help.

Hadley Reynolds, research director at the Delphi Group, Boston, calls it the "infoglut problem," and he sees simple search, particularly keyword search, as "much too crude a weapon" against it. "What the users of portals want when they go to a search box is answers," he says, "but they don't get answers. They get lists of documents that allow them to spend more time trying to refine what they get from those documents and figuring out whether it is actually relevant or not."

Resources

Applied Semantics
www.appliedsemantics.com

Autonomy
www.autonomy.com

Delphi Group
www.delphigroup.com

IDC
www.idc.com

Inxight Software
www.inxight.com

Quiver
www.quiver.com

Sageware
www.sageware.com

Semio
www.semio.com

Smartlogik
www.smartlogik.com

Stratify
www.stratify.com

Sue Feldman of IDC, Framingham, MA, puts it this way: "People have trouble finding information in invisible information spaces. ... You have a search box and no way of knowing what's behind it."

One way of structuring heaps of unorganized information is to implement a taxonomy, which serves as a navigational tool as well as an organizational tool. With a taxonomy, "people can automatically see what's there," explains Feldman, director of IDC's document and content technology program. "[They can] run their eyes over the top level of the taxonomy and see whether it's a place they want to search in or is unlikely to have anything that interests them."

As significant a role as a taxonomy plays, chances are you don't see the application itself when you look at a portal page or search engine screen; taxonomies tend to be part of the information infrastructure rather than the front end. But when you see a subject area divided into a set of topics, or when a search returns not a jumble of hits but an array of possibilities sorted into categories, it was a taxonomy engine of some sort (or human librarians) that generated the categories.

Algorithms and Approaches

Plenty of vendors offer categorization solutions, and they engage in the inevitable arguments about which technology is better. The taxonomy market, says Reynolds of Delphi, is "an algorithm-based industry, and most of the companies have taken a very technology-centric approach, which is to try to commercialize a particular algorithm or set of algorithms."

The algorithmic and computational approaches break into the following families:

Rule-based: Rule sets contain keywords and logical relationships according to which target documents are automatically analyzed and sorted. Rules are generally designed by human experts but may be generated by analyzing a sample body of documents. Verity offers technology using this approach.

Statistical analysis and pattern matching: Algorithms analyze texts for the presence, frequency and relative location of terms. The program may identify patterns that emerge from the target documents themselves or be taught what to look for from training sets of documents. Autonomy, Quiver, Semio and Smartlogik are among the vendors following this path.

Linguistic analysis: Algorithms identify the important linguistic and semantic elements (understanding phrases or the parts of speech) in the target documents. This identification can be used in conjunction with statistical analysis. Inxight and Semio have algorithms taking this approach.

Ontologies: Human-developed data structures that combine elements of taxonomies and thesauri and include information about relationships among terms. Target documents are analyzed against its structures. Applied Semantics helps companies develop ontologies.

Some players are moving away from single techniques and toward combinations of approaches. For example, Quiver, San Francisco, uses two different statistical methods to refine its results, and Inxight, Palo Alto, CA, runs documents through linguistic analysis (capable of handling a dozen languages) and then statistical algorithms. The hybrid champ may be Stratify, Mountain View, CA, which employs five computational techniques to analyze content—"Bayesian probability" and "support vector machine" statistical analysis, keywords, and source- and rules-based decisions—and feeds the results into a proprietary combiner element to make the category assignment.

"Each of these technologies has different strengths and weaknesses," says Stratify chief technology officer Ramana Venkata. "It is logical to expect that a classification technology that leverages several techniques but knows how to arbitrate among these technologies is going to perform better than any technique individually."

Leaving the technical-issues debate to the specialists, it's possible to list some general questions you should ask about a categorization solution:

  • How accurate and reliable are the category assignments? If the system constructs the taxonomy, how well does it represent the topics you need?
  • Is the combination of automation and human effort appropriate for your setting? Where is automation invested-in constructing categories, writing rules, selecting documents for training sets? Can you adjust the degree of automation?
  • How flexible is the system? Can the categories be tweaked for better accuracy? Can the system spot new kinds of items and adjust the taxonomy or alert a human to do so?
  • Does the system integrate with, or map to, standard taxonomies? If your organization already has some knowledge management technology in place, will it integrate with that?
  • How does the system scale? Can it grow to accommodate a large volume of content and a large and complex taxonomy? Does it need to?
  • Does the system offer personalized responses for individual users? Does it integrate with other applications? Is its output in a standard format?

The proper context for these questions is a clear understanding of the setting—business and technological—for the taxonomy. What do you hope to accomplish? Is the main function to expose and unify content for internal applications such as human resources or customer relationship management? Or will the taxonomy application enable sales and marketing people to trawl the Web for business intelligence? Is there a vertical-market angle, such as pharmaceutical or financial service, that requires expertise on the vendor's part? Will the taxonomy serve a company portal with general users and open Web connections?

Manual vs. Auto Classification

Intertwined with the debate about computational techniques is a more general question: How much of the job of analyzing content and creating categories can be automated and where is human intelligence required? Manual systems can reflect real human understanding of content and of the way people might actually associate topics. The Internet portal Yahoo!, for example, employs an army of categorizers to improve the accuracy of the company's search engine. This approach is labor-intensive and costly, and doesn't scale well. In addition, human categorizers aren't necessarily consistent in assigning categories.

Figure 1
Manual and automated approaches to building taxonomies both present pros and cons. Companies have to strike a balance between cost and accuracy.

Purely computational approaches, on the other hand, may not deliver the combination of accuracy and precision that people need in their work. John Lehman, president and CEO of Sageware, Mountain View, CA, says that automation advocates have "always hoped that there was a magic algorithm that would stick this virtual electrode on your head and no matter what you said, it would know what you meant as far as information retrieval was concerned."

But software by itself can't read minds, and it's not much better at reading text. Ian Hersey of Inxight, Palo Alto, CA, points out that the best algorithms available—under optimal conditions, with hundreds of training documents and narrow data sets—can manage 75 to 80 percent accuracy in categorization. On a typical intranet, with its broad range of content, the best tools, with the training sets of 10 or 20 documents, are getting 50 to 80 percent accuracy, "which means half the content is missing or miscategorized," Hersey says. "That's just not acceptable."

Automation alone wasn't enough for one Fortune 500 corporate portal administrator who wished to remain anonymous. He needed to classify 500,000 documents from a number of different organizations and wanted at least 90 percent accuracy. "You can't tell me that any technology, machine-trained or not, can be accurate enough," he says. "You have to have technology that's smart enough to point out what needs to be moved and what doesn't."

What this user wanted was "the best of both worlds" of manual and automated approaches. "People are so focused on trying to make it easy for the general user or on making the product trainable that they forget about the human interaction," he says. "The challenge is to find an automated solution with human intervention built into it."

The portal administrator's answer was to add Quiver's QKS Classifier to his Inktomi search engine. QKS is an "editorial-assisted algorithm" according to Quiver vice president of sales and marketing Andrew Feit. The categorization engine screens documents and suggests topics, but then, Feit explains, "That output flows into a workflow environment that allows humans to see the results of the categorization engine, agree with them, override them and put things where they belong."

QKS Classifier can be tuned using topic-by-topic business rules that allow some topics to be automatically published with no human intervention. The human information management can be centralized or distributed, so it's possible for departments or individuals to have ownership and control of topics in their areas of operation and expertise.

How much human intervention is needed, and where should it be applied? Delphi's Reynolds says, "I think that more companies are going to be looking to that whole-product approach. Rather than buying an algorithm, we want an environment in which to work with the stuff. Algorithms are going to be important underneath, but what we really need is a vehicle for the human taxonomy managers or portal editors or subject matter experts to have input into the process."

Figure 2
Some systems combine human interaction with automation tools in order to "train" the technology to categorize content appropriately.

The Payoff in Taxonomy

For technology that's generally buried in infrastructure, a categorization tool can do some impressive tricks, even beyond making that shiny new company portal more tidy and responsive without hiring a staff of digital librarians to organize its virtual shelves.

The benefit is obvious for commercial search engines and news organizations. Inxight's Hersey says his company licenses its technology to a number of search engines, but its best direct market has been content publishers such as news services. "They tend to have very large, dynamic taxonomies since news organizations change topicality a lot," he explains.

For example, Factiva, the Dow Jones-Reuters content provider, maintains 1,500 topics. It chose Inxight's software for the ability to handle multiple languages and automate classification. The system's core job is applying metadata to incoming stories so that they are assigned to the right topic categories and can then be delivered to the right customers. Even the training stage of implementation yielded useful results, according to Chris Porter, Factiva's director of coding systems.

"The system pointed out errors of application and helped us to see where we weren't as clear in our coding concepts as we should be," Porter states.

The automated end of the system now works as intended: freeing the humans to make human judgments. "We agreed on a target with Inxight that at least 45 percent of stories passing through would need no manual attention," he added. "Test are currently showing that it's running at between 60 and 80 percent."

The BBC turned to Smartlogik, London and San Francisco, for similar services. "We manage [the BBC's] corporate taxonomy to provide [it] with news," says Smartlogik's vice president of marketing Roger Frugia. "We manage the categories and built the rules, and as new information comes out from the news feeds, it automatically gets categorized and sent to the desktops of the people doing research. It's taxonomy and categorization."

Smartlogik's taxonomy and discovery products provide the automation behind NEON (news information online), which replaced a manual, paper-based story-clipping and archiving service that previously served the BBC's newsrooms. Several thousand researchers and journalists use NEON for fast access to new and archived stories. The company estimates direct savings and productivity improvements amounting to more than $40 million a year. And, stated BBC Information & Archive projects manager Adam Lee, "NEON allows the BBC to get accurate stories on the air before [its] competitors."

For those not in the news business, the usual productivity benefits apply. According to Lehman of Sageware, taxonomy technology can help companies generate profit or cut costs by sorting through vast stores of information quickly without legions of researchers. He says Sageware did exactly that for KPMG. The consulting firm wanted a system that would encompass its existing internal corporate knowledge base as well as the thousands of information sources it licensed from all over the world. KPMG's "real education process" emerged when it mapped the way it described its products and solutions internally against external descriptions.

"It was pretty easy to categorize KPMG's marketing literature and engagement letters and proposals and reports, because it was all in the lingo of KPMG," Lehman says. "But the outside world didn't write anything that was already entered around KPMG's products and services. It took us some time to convince [KPMG] that [it] needed to look at the business functions, problems and events that would cause companies to look for the services that KPMG provides."

Perhaps the most mysterious and exciting promise of these solutions—especially for a corporation's research branch—is the possibility of discovering something you didn't know was in your information hoard. The statistical-linguistic algorithms are well suited to noticing new things in their document collections.

For example, Pamela Crowley, director of public relations at Semio explains that a given search with the San Mateo, CA-based company's technology returns not only the hits with a reliability rating but also lists of related categories to which a given document might also fit. "You actually see the applicability of the data within other categories and what other categories are related to your search," Crowley says.

Sweden's largest medical university, the Karolinska Institutet, reports that its 2,000 researchers have been able to pursue entirely new lines of research using a Semio-based Biomedical Text Mining portal. The system not only returns searched-for documents but it can also reveal linkages among documents that hadn't been seen before.

"Researchers don't always know in advance what they are looking for or how to find it," stated Karolinska Institutet librarian Catharina Rehn, M.D. "Once they are on a path, they need to know that they are viewing all available information relevant to their subject. They can't afford to make decisions based on only 80 percent of the data."

Some vendors of taxonomy technology are benefiting by using their own technology internally. Autonomy's director of business development, Gary Bryan, has his company's Active Technology application read over his metaphorical shoulder as he works. "When we respond to RFPs, we actually have the technology read along with us," he says. "As I'm working, the system might inform me that there's an RFP [in our files] that is already 85 percent similar to the one I'm working on. I didn't know it was there; it told me it's there. I can grab it and look at it and start comparing sentence by sentence and running it against my known base of documents and use it to get the job done faster, instead of writing from scratch every time."

This technology may not be quite like asking HAL, the computer featured in "2001: A Space Odyssey," to help with your homework, but it at least gives you a hint of futuristic technology.

Russell Letson (rletson@cloudnet.com) is a freelance writer based in St.Cloud, MN.


The ABC's of Taxonomy & Classification

"Taxonomy" comes from a Greek word meaning "arrangement" or "order," and that's what it is: a way of ordering or arranging a body of unstructured information so that we can make sense of it and find individual items in it.

At its simplest, a taxonomy is a set of buckets or bins—topics, headings, categories—into which items can be sorted. The buckets are chosen according to an organizing principle that determines what goes where, which makes that content meaningful and usable for humans. The organizing principle might be a logical structure akin to the biological taxonomies applied to plants and animals or a set of conventional or arbitrary categories useful in a particular setting. For example, the division of news stories into business, sports and entertainment sections. (Exercise: Imagine a news story that can fit into all three of these categories.)

Organizing a body of unstructured content requires construction of the taxonomy, analysis of the content to determine the meaningful elements (words, phrases, clusters, concepts and so on) and a way of connecting the content elements to the topic elements of the taxonomy.

Of course, people do this all the time. The comprehensive Library of Congress cataloging system shows what can be accomplished by human efforts. Applying computer technology to the problem means getting software to approximate what humans do easily when they read: to see connections, recognize context, understand implications and in general derive "meaning."

"Approximate" is the crucial word here because none of these analytical engines really read the target text. Instead, they draw patterns or statistical measurements from it and make inferences about topicality based on the patterns or measurements. The degree of accuracy that can be expected from an entirely automated categorizing application is one of the big questions users must consider when implementing this technology.


Taxonomy & Classification Glossary

Bayesian analysis: widely used statistical technique for analyzing text; infers topicality from patterns of words and phrases present in documents; a "probabilistic" method because it returns a likelihood of a document belonging to a topic

Disambiguation: process of distinguishing which of several meanings a word or phrase might be correct

Document vector: measurement of how closely documents are related to each other; derived from statistical analysis of content

Granularity: degree of detail or discrimination in a taxonomy; number of categories or topics available for documents to be assigned to

Manual tagging: humans reading and assigning documents to categories or attaching metadata

Ontology: a data construct that reflects the structure of a body of knowledge by including categories, vocabulary and information about relationships; generally assembled by humans and then used to analyze documents

Rule-based categorization: system of assigning documents to categories according to rules, generally devised or optimized by humans

Statistical text analysis: techniques that make inferences about content by measuring frequency and placement of words and phrases; particular techniques include naive Bayes (see Bayesian analysis), support vector machines and K-nearest neighbor

Taxonomy: a structure of categories or topics to which documents can be assigned

Training set: a body of documents used to exemplify the documents belonging to a given category




Channels
Business Process Management
Content Storage
Content Management
Compliance
Enterprise Solutions
Document Scanning & Capture
Content Delivery & Publishing
Collaboration & Knowledge Management
Search and Classification
Locate an article from our print magazine. Just enter your Locator ID Number below.
ID#


NEWS FROM THE PIPELINE

OpenOffice.org 2.0 Closes On Final

New Study Finds Steep Growth For Smartphones

PalmSource Sale Cleared By Federal Agency

CTIA Panel Examines Enterprise Security Risks

[more]






HOME | ARCHIVE | REALWARE AWARDS

A Publication of the Network Computing Enterprise Architecture Group
Brought to you by CMP Media LLC, Copyright © 2005
Privacy Statement | Your California Privacy Rights | Terms Of Service