The rise of the Internet and the World Wide Web is having a dramatic effect on business, technology and users. The most interesting thing about this phenomenon is the way users are reinventing some of yesterday's solutions - and applying them to tomorrow's problems.
Haven't we been here before? The focus on server-dominated computing (now they're called "Web servers"), dumb terminals (a.k.a. "network PCs") and timesharing (e.g. Notes Domino) brings to my mind the Cole Porter lyric, "Everything old is new again." Especially when it comes to text retrieval. Text retrieval started two about decades ago, but it wasn't until the early '90s that it achieved widespread acceptance. Until recently, text retrieval was seen as a supporting technology, embedded in other applications -- most notably document management. But the Web/intranet has changed all that.
The Web is the ideal application for text retrieval -- a virtual, limitless library, with no real way to catalog its contents. This has shifted the spotlight back to text retrieval. Hyperlinks are one way to find information, but they're created by the author of the site and subject to his or her biases. Text retrieval is a powerful way to supplement the retrieval power of the hyperlinks.
Text retrieval lets users quickly locate Web pages containing information they need. But here's the rub: When this powerful technology is used with the volumes of text in the Web, the result is overwhelming. While tools like AltaVista may seem like the answer, they create problems. Query results such as "3652 documents found" make it difficult to find the most powerful and relevant sites.
The challenge of searching and retrieving documents from the Web's voluminous library focuses attention on retrieval speed and breadth. Users often believe the value of a query engine is its ability to find as many documents as possible, from as many sites as possible, in the shortest period of time.
While these challenges are formidable, they can't be met to the exclusion of searching precision. We can no more make sense of "3652 documents found" than we can of the entire Web/intranet site. People have largely focused on the processing power of the search engine. That focus must broaden to include the engine's ability to work out intelligently what you're looking for and provide accurate results.
The evaluation of search engines should begin with an inward focus. Users' evaluations must not only benchmark the engines' processing power but also their approaches to intelligent searching. If users have a keen understanding of the subject matter, are familiar with the subject's lexicon and are looking for very precise results, finite-based query tools are suitable. These tools are based exclusively on retrieval based on exact words. Documents containing the words specified in the query are identified and retrieved.
A more complex search model is investigative research. Here, the research process is more interactive and dynamic. The user has some idea of what they're looking for, but they're not intimately familiar with the subject and the availability of relevant information. In these instances, the research process can't be compressed by accelerating the search of the collection of documents.
Tools need to help the user interrogate the document collection in an interactive manner. They must help the user create more intelligent searches. Synonym files, thesaurus listings and proximity searching once again find themselves in the forefront of text search technologies -- and what to look for when evaluating them.
These tools can broaden a user's search and turn an intelligent insight into a variety of ideas, as well as turning concepts into words and phrases. Retrieval becomes more conceptual. Tool evaluators should determine the exact manner in which intelligence is applied to a query. The various approaches available may appear similar at first, but the way they process logic accounts for the different results you get when you search for the same information with different engines (AltaVista versus OpenText).
Also consider the lucky search where the user has no direct interest in either the amount of information available or its relevance to a specific concept. Instead, it's the linkages of concepts and documents that lead to user discovery. This is Web-surfing. While many consider this environment the exclusive domain of the intranet search hyperlinks, consider tools like rule-of-thumb association and document clustering. Each of these facets can dynamically determine relationships between documents based on a variety of factors -- and can do so independently of the author's subjectivity.
Each environment can be addressed by a variety of searching algorithms. These include:
Inverted word indices. An index of every word in the library is maintained, along with pointers to where the words are located in each document.
n-gram. Every character in the document collection is tracked. Document clustering creates overlapping sets of nodes based on similar content.
Investigate the applicability of each approach and match it to user requirements. Also consider the availability of auxiliary features.
Among the most powerful, and most overlooked, of these is relevancy ranking. In a world where a query often results in "3452 documents found," relevancy ranking is essential. The premise is that many documents may satisfy a query, but some are more closely linked (or relevant) than others.
A search engine that uses relevancy ranking doesn't return a blind set of documents. It returns an intelligently ranked list in order of perceived value to the query. A ranked documents list lets you begin your research with those documents that appear to bear a stronger connection to the query. If you have the interest (and the time), look at other documents that shed light on the subject. You quickly decide if a query meets your need for information by viewing the "best" of the documents first. This feature is a critical component in many intranet/ third party engines.
But how do you determine relevancy? Approaches range from simple query term tallies, suitable for finite research, to compound approaches using term weighting, omni-term skewing and word/document density. These approaches (much too subtle to describe here in detail) can meet the demands of investigative and lucky research.
Other issues to consider are:
Agent-based technology. The basis of Web crawlers.
Query front ends. These make formulating and submitting queries easier.
Automatic document abstraction. Provides thumbnail snapshots of the document's content. Speeds up the user's ability to see a document's relevance.
Compound search engines. Search several engines simultaneously. Dynamically merge the results into a single cohesive relevancy-ranked set of documents.
Bottom line: Evaluators and users must look beyond the hype about "processing power" that often accompanies new Web-based search engines. Many of these search engines are fast and free, but they lack the intuition and intelligence of many "older" third-party products.
Text retrieval is a critical component of information retrieval in the Web/intranet environment. It separates the insignificant from the critical, based on dynamic determination of page content. Selection and deployment of the appropriate engine to each site is a fundamental step in system design.
Carl Frappaolo, Delphi Consulting Group's executive vice president, spends most of the year designing effective electronic document management strategies for large and small companies. He also sits on several industry standards bodies. Contact him at cf@delphigroup.com.