The search for the perfect search engine
There's a long road still to travel
Analysis There are several unsolved problems with search engines: they do not index all the Web; no single engine is likely to be satisfactory for all searching purposes; and their capabilties are often over-hyped. At least the issue of completeness is improving, with Google claiming to have indexed 1.06 billion Web pages - but that's somewhat hyped since there are 560 million URLs and another 500 million links that it has recorded but not visited.
The unsolvable problem is the near-impossibility of having a single engine that could perform all types of searches. Sure, hybrid engines could be used, rather as happens with metasearch engines like ProFusion. This combines results from searches with a chosen set of engines. But the reality is that with several fundamentally different methods of searching being available (and needed), no single engine could be optimised for all searches.
Keyword searching was the first way of finding information and is still perhaps the most important, but the relevance is often low, even if search terms are truncated or logical operators are used (including nearness). The daddy of all these engines was Dialog, which was developed at Lockheed in the early 1960s as part of a NASA contract. Dialog passed first to Knight Ridder and then to MAID in a highly-leveraged buyout, followed by a fire sale to the Thompson Organisation. Its primary use was for bibliographical reference searching, later extended to full text. Engines like Alta Vista are latter-day derivatives, with some bells like translation as well as some missing whistles like code categories. In many ways, keyword engines have their origins in the likes of card indexes, and as we shall see, there's nothing new under the search engine sun.
A long way to go
There has been little progress with semantic engines that parse text, because this is a very difficult thing to do successfully. The problems include the context, and the variety of idiom. Progress has been made in speech recognition and automatic translation engines, and a convergence with search engines is just beginning to be discernable. For an analogy of the present situation, think back to the days of dot matrix printers, daisy wheel printers, and the like: the progress yet to be made in the search/speech recognition/translation convergence is like the shift to laser printing. It's a big leap, but it will happen - although just as there will be small, portable inkjet printers into the foreseeable future, so there will remain a split into text search engines and concept searching engines.
Some search engines allow advertisers or fee payers to have their pages displayed preferentially. Engines like Alta Vista use logical operators (also known incorrectly as Boolean operators) which many users do not fully exploit. The latest rage is for XML search engines that can find content marked up with XML, such as GoXML.com.
Google uses what it calls PageRank to sequence the results of searches, and is applying for a patent for this. It's all based on the number of links to a particular page, which is weighted by the rank of the linking page. Even with Google's 1.06 billion indexed pages, many no longer exist on the site, so the cached version that Google offers is often a blessing for the searcher, and of course a curse for the page creator if it has been removed because of an error. The ideas are similar to those first used many years earlier in Science Citation Index and its ilk.
San Francisco-based Alexa is highly regarded by its users, while Cambridge, UK-based Autonomy has a similar active Web product called Kenjin, which was released about four months ago - free, but Windows only. It analyses concepts in an active window, gives links to more relevant information, and can sift through local content on disk. Kenjin means wise man in Japanese, but at the moment it's more of a wise boy, since it still has some tricks to learn.
How clever is Autonomy?
Autonomy - founded by Mike Lynch and now worth around $5 billion - has achieved renown for its concept matching capability, which is certainly workmanlike, but the success of the company derives more from a solid approach to productisation and marketing. So far Autonomy has encountered little competition, but its technology is not exactly space-age: it is more historically based. The company has what it rather grandly describes as "intellectual foundations". One of these is Bayesian Analysis, which is used to work out probability relationships, and that's been around since the Reverend Thomas Bayes of Tunbridge Wells wrote a pioneering paper about statistical inference in 1763 (it was published posthumously, as he died in 1761).
The other credit is to electrical engineer Claude Shannon, who had an essentially two-bit idea in 1949 that information can be quantified, and that the information content of a word or phrase is inversely proportional to its frequency of occurrence - and that's about it, because the theory doesn't get you much further.
Although few people have probably realised it yet, the convergence we mentioned results from Markov chains and transition probability matrices that are used nowadays in speech recognition engines. These were first introduced by Andrei Andeevich Markov, a Russian who went into self-imposed exile in 1905 and died in 1922. The theory is quite simple: in many cases, the probability of an event depends only on the immediately preceding event (or events, if you want to get more accurate and use second, third or greater orders). The events could be the pronunciation of the syllables of a word or phrase in speech recognition, or as an aid to concept analysis, for example.
There is too much hype, and too many extravagant claims are made by search engine vendors. Search engines cannot be all things to all people, and it's a matter of experience to know which is best for a particular requirement. Users who rely on a few favoured Web sites, or just one Web engine, will not be finding all that exists on the Web - or even just the best stuff.
Serious problems remain in Web site design: many Webmasters destroy the date of the original creation of a document and replace it with an automatically-generated date on which the page or site was last supposedly updated. The original date is very valuable information for future searching, to allow pages to be excluded or included.
There is a also a need for better indexing design if we're to be able to find things on the Web, but the principles for defining the guidelines should come from information science, rather than from nerd-lore. ®
Sponsored: Benefits from the lessons learned in HPC