September 2003
Putting it Together: Taxonomy, Classification & Search
by Jeff Morris
Continued from [ page 1 ]
Classification: Manual or Automatic?
Classifying all the information you have into a taxonomy is a potentially painful but necessary task. For large volumes of information, autoclassification is really a must, though not everyone is willing to rely solely on technology.
"The whole field of autoindexing and classification has made great strides," observes West's Dabney, "but the gold standard is still to have a person look at each document, decide what it's about and assign subjects. It's the volume of information that makes automatic classification necessary, but words have some unique meanings in a legal context, and we don't want to rely on a system that doesn't have a way of dealing with our intellectual domain."
West uses classification techniques that were developed internally, but Dabney says the company isn't doing anything mysterious: "Our system works. You have to give it the category initially, then give it a large training collection. This [works] because we have such a huge library of documents that have been classified manually."
Because words may have unique meanings in different contexts, classification is needed to narrow a search. Feldman uses the example "bank," which could mean a bank of fire, a riverbank, bank as a verb, or, of course, a financial institution.
"As soon as I say my domain is financial information, I know 'bank' is not going to mean 'river' or 'fire'," Feldman explains. "Or think of Ford, Ford or ford: the car, a person or a river. By looking for the other words grouped around it, classification stops you from getting a huge dump of documents just because they have the words you've searched for."
Feldman contends that manual classification is not necessarily superior to automatic classification technology. "We know from studies done back in the 1950s that the consistency from one indexer to another is not great; people are not consistent," she says. There are lots of different ways to classify, each with its strong points and weaknesses. Choosing between automated or manual methods depends upon variables including the need for accuracy versus the need for speed, the availability of labor and whether you have a broad and shallow collection or a deep and narrow collection.
In Dabney's opinion, automatic classification will always have some limitations. "People have been building separate little [classification] algorithms for the past 20 years or so ... but there's no magic bullet algorithm," he says. "A general algorithm doesn't work. You can't really solve all the world's search problems at once."
Another Way to Look at Search
While many "traditional" search interfaces now display results along with taxonomic categories, some purveyors of taxonomy, classification and search technologies are now adding visual components to their search technologies. Woods of Ovum says leading examples include using visualization for multidimensional taxonomy navigation and using graphical user interfaces (GUIs) for understanding complex relationships across information sources.
"Having a GUI makes categorization easier," says Tim Bray, a coinventor of XML and founder of Vancouver-based Antarctica Systems (www.antarctica.net), a developer of data visualization technology. "When you're looking for your own data on your desktop, you typically don't type in a search string; you know where things are and you click on folders. What we're doing [with our Visual Net software] is providing a GUI to make it easier to find shared information."
Bray says that emphasizing the GUI is important because "the quality of search technology has not improved since the 1970s; the basic algorithms have not done much better. The classic example is Google: it does well by applying metadata and using it to supplement brute force full-text search."
Bray contends Google's approach won't work well in an enterprise environment. "The only answers are to generate metadata and to have a better user interface; the two are synergistic, one doesn't work so well without the other."
Other vendors adding graphical navigation features to search include iPhrase, which eases navigation with the taxonomic folders or categories integral to the One Step architecture. San Mateo, CA-based Inxight provides a Collection Explorer as part of its SmartDiscovery taxonomy management, categorization and guided retrieval environment. Collection Explorer incorporates a conventional textual search field, but it also offers a Star Tree Taxonomy View (see Figure 1 and Figure 2) that gives users a quick overview of all categories. By simply dragging the desired category to the center of the Star Tree map, deeper subcategories and branches are revealed.
Figure 1
Figure 2
Yet another vendor combining taxonomy, classification and visually aided search is Vienna, VA-based Convera. The company's RetrievalWare software relies on semantic searching that narrows results to contextually relevant information, but it also applies a classification engine that organizes search results dynamically. Users can further refine the search by applying an additional axis of classification and visualizing the results in tabular form.
[ BACK | NEXT ]
|