Feeds

The search for the perfect search engine

There's a long road still to travel

  • alert
  • submit to reddit

Beginner's guide to SSL certificates

Analysis There are several unsolved problems with search engines: they do not index all the Web; no single engine is likely to be satisfactory for all searching purposes; and their capabilties are often over-hyped. At least the issue of completeness is improving, with Google claiming to have indexed 1.06 billion Web pages - but that's somewhat hyped since there are 560 million URLs and another 500 million links that it has recorded but not visited.

The unsolvable problem is the near-impossibility of having a single engine that could perform all types of searches. Sure, hybrid engines could be used, rather as happens with metasearch engines like ProFusion. This combines results from searches with a chosen set of engines. But the reality is that with several fundamentally different methods of searching being available (and needed), no single engine could be optimised for all searches.

Keyword searching was the first way of finding information and is still perhaps the most important, but the relevance is often low, even if search terms are truncated or logical operators are used (including nearness). The daddy of all these engines was Dialog, which was developed at Lockheed in the early 1960s as part of a NASA contract. Dialog passed first to Knight Ridder and then to MAID in a highly-leveraged buyout, followed by a fire sale to the Thompson Organisation. Its primary use was for bibliographical reference searching, later extended to full text. Engines like Alta Vista are latter-day derivatives, with some bells like translation as well as some missing whistles like code categories. In many ways, keyword engines have their origins in the likes of card indexes, and as we shall see, there's nothing new under the search engine sun.

A long way to go

There has been little progress with semantic engines that parse text, because this is a very difficult thing to do successfully. The problems include the context, and the variety of idiom. Progress has been made in speech recognition and automatic translation engines, and a convergence with search engines is just beginning to be discernable. For an analogy of the present situation, think back to the days of dot matrix printers, daisy wheel printers, and the like: the progress yet to be made in the search/speech recognition/translation convergence is like the shift to laser printing. It's a big leap, but it will happen - although just as there will be small, portable inkjet printers into the foreseeable future, so there will remain a split into text search engines and concept searching engines.

Some search engines allow advertisers or fee payers to have their pages displayed preferentially. Engines like Alta Vista use logical operators (also known incorrectly as Boolean operators) which many users do not fully exploit. The latest rage is for XML search engines that can find content marked up with XML, such as GoXML.com.

Google uses what it calls PageRank to sequence the results of searches, and is applying for a patent for this. It's all based on the number of links to a particular page, which is weighted by the rank of the linking page. Even with Google's 1.06 billion indexed pages, many no longer exist on the site, so the cached version that Google offers is often a blessing for the searcher, and of course a curse for the page creator if it has been removed because of an error. The ideas are similar to those first used many years earlier in Science Citation Index and its ilk.

San Francisco-based Alexa is highly regarded by its users, while Cambridge, UK-based Autonomy has a similar active Web product called Kenjin, which was released about four months ago - free, but Windows only. It analyses concepts in an active window, gives links to more relevant information, and can sift through local content on disk. Kenjin means wise man in Japanese, but at the moment it's more of a wise boy, since it still has some tricks to learn.

How clever is Autonomy?
Autonomy - founded by Mike Lynch and now worth around $5 billion - has achieved renown for its concept matching capability, which is certainly workmanlike, but the success of the company derives more from a solid approach to productisation and marketing. So far Autonomy has encountered little competition, but its technology is not exactly space-age: it is more historically based. The company has what it rather grandly describes as "intellectual foundations". One of these is Bayesian Analysis, which is used to work out probability relationships, and that's been around since the Reverend Thomas Bayes of Tunbridge Wells wrote a pioneering paper about statistical inference in 1763 (it was published posthumously, as he died in 1761).

The other credit is to electrical engineer Claude Shannon, who had an essentially two-bit idea in 1949 that information can be quantified, and that the information content of a word or phrase is inversely proportional to its frequency of occurrence - and that's about it, because the theory doesn't get you much further.

Although few people have probably realised it yet, the convergence we mentioned results from Markov chains and transition probability matrices that are used nowadays in speech recognition engines. These were first introduced by Andrei Andeevich Markov, a Russian who went into self-imposed exile in 1905 and died in 1922. The theory is quite simple: in many cases, the probability of an event depends only on the immediately preceding event (or events, if you want to get more accurate and use second, third or greater orders). The events could be the pronunciation of the syllables of a word or phrase in speech recognition, or as an aid to concept analysis, for example.

There is too much hype, and too many extravagant claims are made by search engine vendors. Search engines cannot be all things to all people, and it's a matter of experience to know which is best for a particular requirement. Users who rely on a few favoured Web sites, or just one Web engine, will not be finding all that exists on the Web - or even just the best stuff.

Serious problems remain in Web site design: many Webmasters destroy the date of the original creation of a document and replace it with an automatically-generated date on which the page or site was last supposedly updated. The original date is very valuable information for future searching, to allow pages to be excluded or included.

There is a also a need for better indexing design if we're to be able to find things on the Web, but the principles for defining the guidelines should come from information science, rather than from nerd-lore. ®

Security for virtualized datacentres

More from The Register

next story
Phones 4u slips into administration after EE cuts ties with Brit mobe retailer
More than 5,500 jobs could be axed if rescue mission fails
Israeli spies rebel over mass-snooping on innocent Palestinians
'Disciplinary treatment will be sharp and clear' vow spy-chiefs
Apple CEO Tim Cook: TV is TERRIBLE and stuck in the 1970s
The iKing thinks telly is far too fiddly and ugly – basically, iTunes
Huawei ditches new Windows Phone mobe plans, blames poor sales
Giganto mobe firm slams door shut on Microsoft. OH DEAR
Phones 4u website DIES as wounded mobe retailer struggles to stay above water
Founder blames 'ruthless network partners' for implosion
Found inside ISIS terror chap's laptop: CELINE DION tunes
REPORT: Stash of terrorist material found in Syria Dell box
Show us your Five-Eyes SECRETS says Privacy International
Refusal to disclose GCHQ canteen menus and prices triggers Euro Human Rights Court action
prev story

Whitepapers

Providing a secure and efficient Helpdesk
A single remote control platform for user support is be key to providing an efficient helpdesk. Retain full control over the way in which screen and keystroke data is transmitted.
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.
Security and trust: The backbone of doing business over the internet
Explores the current state of website security and the contributions Symantec is making to help organizations protect critical data and build trust with customers.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.