April 1999
Search & Retrieval
Special Agents Find It For You
By Penny Lunt
New search technologies turn your information quest into a targeted investigation.
Managing massive amounts of information and finding just the piece of ýknowledgeý you need -- be it a competitorýs product launch date, the name of a person who understands a subject youýre researching or a document that you read once but forgot the author, title and date -- is a task that even Fox Mulder would find daunting.
You can use a search engine on a corporate intranet, but itýs likely to turn up irrelevant pages. Just as an Excite search on the Internet for ýportalý cites 46,181 pages, many of which are about toys, luggage or the portal gun used in Quake, an intranet search with a traditional text-search engine will call up every document that contains the word you typed in.
New search tools, including corporate portals and advanced search software, handle the search more intelligently. They rank results based on relevance. They categorize and classify results so that you can search within a narrower group of sources and/or rule out information that doesnýt pertain to your search. They let you pose queries as natural-language questions and look for not just the subject but the type of information you need. They let you visualize where information resides. They search not just by words, but by concepts.
How do these search technologies benefit you?
Access to everything: Search and portal software let you access electronic documents, images, emails, the Internet and a corporate intranet. Instead of having to duplicate documents or put them in HTML format to post to an intranet, you can keep them in their different repositories and formats, yet with one query retrieve them all.
In an international company, such a search could allow a developer working on a new product in the U.S. to find files from a similar project handled by a counterpart in Germany. Thus, the company could ýknow what it knowsý instead of reinventing the wheel.
Faster searches: Some companies worry about how much time employees spend surfing the Internet, an intranet and other information sources. Software that conducts narrower and more intelligent searches saves the time of wading through hundreds of questionable results.
Timely new information: Portal software regularly scans information repositories (including news feeds, competitorsý Web sites, internal project folders) for new items on subjects employees are interested in and sends them items or notifications .
For example, engineers need to keep up with suppliersý new products and problems with products. One vendor might sell a better product line to your competitor. Another might discontinue a part you use in one of your products. A portal can search the Internet for any information about those vendors and create a daily publication for engineers on what those vendors are up to.
Portals Search and Deliver
A portal, according to my dictionary, is ýan entrance, door or gate, especially one that is grand or imposing.ý Portal software is actually more like a butler at your information doorway. It keeps out unwanted riffraff and ushers in only what you want to see -- newsfeeds, emails, links to Web sites and documents you have an interest in and that pertain to your job.
An advanced corporate portal searches across the range of corporate sources of information, including document management systems, enterprise resource planning systems, groupware systems, intranets and email systems. It then pushes the information through browsers to each individual according to their specific needs. Itýs like a customized My Yahoo! or My Netscape page applied to internal information sources as well as the Internet.
Portals are an example of what Eric Woods describes as ýthe more intuitiveý search and retrieval technologies coming to the fore. ýI think the real secret will be understanding how users want to interact with the system,ý says Woods, principal consultant at Ovum, an IT consulting firm based in London. ýThe products that will be successful will be the ones that combine a degree of automation with the intuitive interface people need.ý
Wood cites Autonomy, which is headquartered in San Francisco, as a vendor offering an intuitive portal product. Autonomy has a background in pattern recognition technology and intelligent agent technology. ýNow theyýre automatically identifying user profiles and recording how people use information on the system,ý says Woods. ýWhen you do a search, youýre not just searching whatýs in the company, youýre searching across profiles of use within the company. You can not only identify sources of information -- documents or an intranet -- but also people, because you find people making common use of information in the company.ý
Autonomyýs Portal-In-A-Box ($50,000+) was introduced in February. It lets intranet developers create an information portal for users that melds information on the Internet with internal information sources such as ODMA-compatible document management systems, a Lotus Notes database, word processing documents, PDF files, email messages, Excel files and PowerPoint presentations. You can set up a directory structure and have agents scan these information repositories for documents that fit into those categories.
Unlike Internet portals like My Yahoo!, Portal-In-A-Box doesnýt rely on predefined categories. ýIn a corporation, when somebody writes a letter or a report, they donýt metatag it,ý points out Michael Lynch, Autonomyýs CEO. ýThe technology has to be a lot cleverer at understanding what somethingýs about. Having preselected channels is also not very useful because different people in the corporation will have different interests. I may be interested in any breaking news on the Malaysian teak industry, but youýre unlikely to have an abundance of people all wanting news in that category.ý
Portal-In-A-Box uses concept searching technology developed by Autonomy. Rather than looking for specific keywords or processing Boolean equations, it uses probability to look at the words in a document, infer what the document is talking about and then match it to other documents. Concept matching avoids the mistakes you get in keyword searches, such as searching for penguins and getting articles on the Pittsburgh Penguins hockey team.
Concept searching also lets you move beyond the formal queries older technologies required. There are two ways of telling the system the type of information youýre interested in. One is to type a paragraph of natural language to ask for what you want. The system analyzes that paragraph, works out what the ideas are and then looks for those ideas in the content using a method called Bayesian inference. (This method is also used in speech recognition, object recognition and intelligent character recognition.)
Another way users tell the system what theyýre interested in is by giving it examples. You could click on a memo and a news item, both about the Malaysian teak industry, and tell the system to let you know if anything else like those comes in. In both cases, the examples and the results donýt have to contain any of the same words -- the software looks for ideas and concepts.
ýIn a corporate situation thatýs very important because youýll use different language if youýre writing a memo to a work colleague, to the boss or to a customer,ý Lynch says. ýThe system has to be able to see that all those messages are about the same thing.ý
Portal-In-A-Box also performs automatic hypertext linking. If you pull up a letter on your corporate intranet, the system reads it, figures out what itýs about and gives you hypertext links to related items, either internal or external.
ýYou can be working on your word processor and the system will suggest other documents related to the letter youýre writing,ý Lynch says. ýIt can even suggest people you should talk to; having looked at the sort of things each person is working on, it can automatically identify their areas of expertise. People tend to be far more useful than documents.ý
So far, Autonomy users have been online publishers such as Rupert Murdochýs News Corp. and The Chronicle Publishing Company. They use the product to automatically aggregate, categorize, hyperlink and personalize articles from different publications, instead of requiring editorial staff to do it. Both companies report that Autonomy does automatically what more than 100 editorial assistants used to do manually. But Portal-In-A-Box is targeted at any company that has an intranet and/or a web site and needs to search across that and other sources.
Another portal provider is two-year-old Plumtree in San Francisco, CA. The Plumtree Server ($50,000+) resides on an intranet. It uses the popular Verity engine for text searching. It can text-search a document repository, the Web and/or a groupware database. It also searches metadata tags on ERP systems, data warehouses, business intelligence systems and OLAP.
ýThe key advantage to Plumtree is the types of information we handle -- structured as well as unstructured data,ý says Glenn Kelman, vice president of product management and marketing. ýWe use the metadata: Whoýs the author of this content, when was it created, what is it about?ý
Unlike Autonomyýs technology, the Plumtree Server doesnýt attempt to create categories automatically. Rather, it lets the intranet administrator define information categories and how they should be displayed. Then it automatically organizes and maintains access to content in those categories.
ýWeýre confident that the business process is a custom process and that no engine, no matter how intelligent, can ever truly understand your business or how you think of your world,ý Kelman says.
The Plumtree Server maintains a card-catalog-like index for every document or image file, report, email or Web page encompassed by the portal. Each ýcardý contains basic information about the document such as author, subject and summary. For example, Plumtree can pull the sender and subject from the metadata of every email within Lotus Notes. The summaries are generated by the Verity search engine.
Caterpillar, the tractor manufacturer based in Peoria, IL, has been piloting Plumtree software for a few months with 100 engineer users and hopes to roll it out to all 65,000 employees across the world.
ýThe thing about information in a big company is you have so much of it, you need to be able to organize it,ý says Anne Jeanblanc, intranet consultant and manager of this project. ýThis is a good way to organize it. You can manage information inside the organization and get information from the outside, from news wires, from your competitorsý sites, from all different sources. You can create filters so you only get what you want, instead of tons of information.ý A filter might weed out pages that are more than 30 days old or that have certain words in them, for example. ýThat eliminates a lot.ý
She likes the card catalog scenario. ýThatýs something people have grown up with and understand,ý she says. She also likes the robust Oracle back-end. So far, they use Plumtree to access files on file servers, on their intranet and over the Internet. They appreciate being able to access files in their native format instead of having to convert them to HTML and post them to the intranet. ýItýs hard work to keep your Web site up to date internally,ý says Jeanblanc.
For example, a security group might need to post procedure documents that need to be tweaked and updated monthly. With Plumtree, all theyýd have to do is store the revised documents to a network drive subdirectory and schedule Plumtree to comb that subdirectory for new items, create library cards and notify everyone about the new content.
One thing new to Caterpillar is that Plumtree runs on Microsoft IIS. The companyýs Web server runs on Netscape and thatýs what the operations group has been familiar with. On the other hand, Plumtree uses the NT security and passwords they already have on their network, which is convenient.
Caterpillar hopes that by rolling Plumtree out company wide, theyýll be able to break down cultural barriers that separate its 26 different business units. ýThey work across the world, they need to share information across the world,ý Jeanblanc says.
The Search Engines
Web-style search engines are also moving beyond mere text searching and working across multiple sources of information. Unlike portals, they donýt proactively push information to users. They accept queries and produce results.
Excalibur Technologies of Vienna, VA, began making pattern recognition software in 1980. Their RetrievalWare search software ($9,950+) uses pattern recognition technology and concept searching to search the Internet, intranets, document management systems, word processing files, PDF files, newsfeeds, groupware systems (Exchange and Notes) and email as well as graphical images and video files.
One way Excalibur uses pattern recognition is for ýfault-tolerantý searching -- particularly for documents that have been scanned and OCRed. The OCR results of long documents are scattered with errors. The fuzzy search capability of pattern recognition means that if you conducted a search for the word ýinstallation,ý Excalibur would be able to turn up the words ýinstahationý and ýinstamation,ý recognizing that those words fit the same pattern.
ýItýs more than just searching,ý says Excalibur director of marketing Mark Demers. ýItýs the ability to overcome errors.ý
Excaliburýs concept searching tries to bridge the gap between the words a searcher uses to find a document and the words the author used. It looks at words in and around a query to determine the context. The software uses a neural network, a semantic network, a 600,000-word dictionary and thesaurus and natural language understanding to help make these matches. They provide Boolean searches as well as ýmost occurrencesý searches.
UCLA uses RetrievalWare campus-wide to filter and distribute information. When a document (such as an email) has words in it that match a user profile, it gets automatically routed to that person. United Airlines uses the software to provide more than 10,000 mechanics and 5,000 operations specialists with instant access to manuals and procedures. They can type a natural-language question and search all documents and scanned images.
Pattern searching also can be applied to image searching and video searching. If you search for a sunset, the software will deliver any image linked to a sunset. A photo might be called ýsun going downý and the software will see that thatýs related. Yahoo! uses this method to use pictures to find other pictures.
ýExcalibur has a niche of searching for images and video,ý says Woods of Ovum. ýThatýs their strength.ý
RetrievalWare is client/server technology that enables distributed searching through NT, Windows 95 or any Web browser. Version 6.7 of the product, set to launch in June, will have automatic categorization and enhanced user interface capabilities from within Lotus Notes. There will also be an Experts Directory thatýs similar to Autonomyýs people monitoring.
ýToday weýre solving the problem of finding the data [and intelligence] in documents,ý says Demers. ýThe new version will link people with people.ý
If you are doing a lot of searches on a particular topic, the software will pick up on that. There wonýt be a privacy issue -- it will be your choice to save your queries and results and share them with others.
PC Docs/Fulcrum of Toronto, Ontario, is the result of document management vendor PC Docsý acquisition last year of search vendor Fulcrum. At press time, PC Docs was due to be acquired by Hummingbird Communications (also in Toronto). The plan was for the merged company to combine Hummingbirdýs experience handling structured data with PC Docsý expertise in document management to create an Enterprise Knowledge Portal.
SearchServer is Fulcrumýs search and retrieval toolkit, which is sold to software vendors through OEM agreements as well as to resellers. A software vendor might use it in a CD-ROM searching application, for example.
SearchServer 3.7E provides Boolean and proximity word searching. The next release, which will come out in the second half of this year, will have more proximity searching, such as the ability to find two words within the same phrase, two words within the same paragraph and word ýAý within three words of word ýBý.
ýThe strength of SearchServer versus Verity and Excalibur is its scaleability,ý says Mario Couture, product manager. ýThe number of documents it can access in a single search is 60 million times 127,000.ý This makes searching across thousands of documents more efficient.
DocsFulcrum is an application built on top of SearchServer. It provides a single point of access to Microsoft Exchange, Lotus Notes, an intranet or the Internet, a file system, databases and other applications built on SearchServer. It provides a ýknowledge mapý that visually shows the physical location of documents the same way Windows Explorer does. The current version, 2.8, provides support for Novell Netware, search-term highlighting when viewing any document format in HTML and support for NT local group permissions. The next generation of the product will let users create enterprise tables of contents based on the content of documents rather than the location. It will use neural network technology to automatically classify information. The idea is that if you donýt know where to look for information, this electronic table of contents will direct you to the right spot, just as the table of contents in a book does.
DocsFulcrum also provides summarization and distributed searching. The summarization function extracts noun phrases from a document. This lets it determine whatýs unique about the document by identifying the terms and then the phrases based on other documents.
ýUsers want the software to do the work,ý Couture explains. ýWhen I do a search that turns up 20 documents of 50 pages each, I donýt want to read each one; I want a summary.ý
Document management vendor Documentum of Pleasanton, CA, recently bought a search engine vendor called Relevance. The company says it will include their technology in a future release. Similarly, web-based knowledge/document management vendor OpenText of Waterloo, Ontario, recently bought Lava Systems, a knowledge management software company based in Toronto, Ontario.
Verity (Sunnyvale, CA) is the leader of OEM deals. Its search engine is in Lotus Notes, Plumtree and many other products.
Verity launched its newest product, the Knowledge Organizer, in mid-March. It creates precise classifications of documents of particular types -- they might be on a particular subject, written by the same person or generated in the same department.
If you know the category you need, you can quickly search within that category rather than search through all the documents in the company. The software also lets you create categories through a mix of human and computer effort.
ýOther companies try to make searching all machine-based,ý says Hanno Sanders, product manager at Verity. ýWe believe machines make mistakes and that itýs better to have a human plus a machine.ý
The Knowledge Organizer builds ýtopicsý -- definitions of categories using existing folders and metadata and human input. It facilitates interactive searching by letting you type your search request, get a list of categories, make a choice and drill down.
Each of these products has its own strengths. Theyýre all new and evolving. Look for one that has proven integration with your existing software.
Related Articles: