Feeds

The search for the perfect search engine

There's a long road still to travel

  • alert
  • submit to reddit

Internet Security Threat Report 2014

Analysis There are several unsolved problems with search engines: they do not index all the Web; no single engine is likely to be satisfactory for all searching purposes; and their capabilties are often over-hyped. At least the issue of completeness is improving, with Google claiming to have indexed 1.06 billion Web pages - but that's somewhat hyped since there are 560 million URLs and another 500 million links that it has recorded but not visited.

The unsolvable problem is the near-impossibility of having a single engine that could perform all types of searches. Sure, hybrid engines could be used, rather as happens with metasearch engines like ProFusion. This combines results from searches with a chosen set of engines. But the reality is that with several fundamentally different methods of searching being available (and needed), no single engine could be optimised for all searches.

Keyword searching was the first way of finding information and is still perhaps the most important, but the relevance is often low, even if search terms are truncated or logical operators are used (including nearness). The daddy of all these engines was Dialog, which was developed at Lockheed in the early 1960s as part of a NASA contract. Dialog passed first to Knight Ridder and then to MAID in a highly-leveraged buyout, followed by a fire sale to the Thompson Organisation. Its primary use was for bibliographical reference searching, later extended to full text. Engines like Alta Vista are latter-day derivatives, with some bells like translation as well as some missing whistles like code categories. In many ways, keyword engines have their origins in the likes of card indexes, and as we shall see, there's nothing new under the search engine sun.

A long way to go

There has been little progress with semantic engines that parse text, because this is a very difficult thing to do successfully. The problems include the context, and the variety of idiom. Progress has been made in speech recognition and automatic translation engines, and a convergence with search engines is just beginning to be discernable. For an analogy of the present situation, think back to the days of dot matrix printers, daisy wheel printers, and the like: the progress yet to be made in the search/speech recognition/translation convergence is like the shift to laser printing. It's a big leap, but it will happen - although just as there will be small, portable inkjet printers into the foreseeable future, so there will remain a split into text search engines and concept searching engines.

Some search engines allow advertisers or fee payers to have their pages displayed preferentially. Engines like Alta Vista use logical operators (also known incorrectly as Boolean operators) which many users do not fully exploit. The latest rage is for XML search engines that can find content marked up with XML, such as GoXML.com.

Google uses what it calls PageRank to sequence the results of searches, and is applying for a patent for this. It's all based on the number of links to a particular page, which is weighted by the rank of the linking page. Even with Google's 1.06 billion indexed pages, many no longer exist on the site, so the cached version that Google offers is often a blessing for the searcher, and of course a curse for the page creator if it has been removed because of an error. The ideas are similar to those first used many years earlier in Science Citation Index and its ilk.

San Francisco-based Alexa is highly regarded by its users, while Cambridge, UK-based Autonomy has a similar active Web product called Kenjin, which was released about four months ago - free, but Windows only. It analyses concepts in an active window, gives links to more relevant information, and can sift through local content on disk. Kenjin means wise man in Japanese, but at the moment it's more of a wise boy, since it still has some tricks to learn.

How clever is Autonomy?
Autonomy - founded by Mike Lynch and now worth around $5 billion - has achieved renown for its concept matching capability, which is certainly workmanlike, but the success of the company derives more from a solid approach to productisation and marketing. So far Autonomy has encountered little competition, but its technology is not exactly space-age: it is more historically based. The company has what it rather grandly describes as "intellectual foundations". One of these is Bayesian Analysis, which is used to work out probability relationships, and that's been around since the Reverend Thomas Bayes of Tunbridge Wells wrote a pioneering paper about statistical inference in 1763 (it was published posthumously, as he died in 1761).

The other credit is to electrical engineer Claude Shannon, who had an essentially two-bit idea in 1949 that information can be quantified, and that the information content of a word or phrase is inversely proportional to its frequency of occurrence - and that's about it, because the theory doesn't get you much further.

Although few people have probably realised it yet, the convergence we mentioned results from Markov chains and transition probability matrices that are used nowadays in speech recognition engines. These were first introduced by Andrei Andeevich Markov, a Russian who went into self-imposed exile in 1905 and died in 1922. The theory is quite simple: in many cases, the probability of an event depends only on the immediately preceding event (or events, if you want to get more accurate and use second, third or greater orders). The events could be the pronunciation of the syllables of a word or phrase in speech recognition, or as an aid to concept analysis, for example.

There is too much hype, and too many extravagant claims are made by search engine vendors. Search engines cannot be all things to all people, and it's a matter of experience to know which is best for a particular requirement. Users who rely on a few favoured Web sites, or just one Web engine, will not be finding all that exists on the Web - or even just the best stuff.

Serious problems remain in Web site design: many Webmasters destroy the date of the original creation of a document and replace it with an automatically-generated date on which the page or site was last supposedly updated. The original date is very valuable information for future searching, to allow pages to be excluded or included.

There is a also a need for better indexing design if we're to be able to find things on the Web, but the principles for defining the guidelines should come from information science, rather than from nerd-lore. ®

Providing a secure and efficient Helpdesk

More from The Register

next story
Doctor Who's Flatline: Cool monsters, yes, but utterly limp subplots
We know what the Doctor does, stop going on about it already
Facebook, Apple: LADIES! Why not FREEZE your EGGS? It's on the company!
No biological clockwatching when you work in Silicon Valley
'Cowardly, venomous trolls' threatened with TWO-YEAR sentences for menacing posts
UK government: 'Taking a stand against a baying cyber-mob'
Happiness economics is bollocks. Oh, UK.gov just adopted it? Er ...
Opportunity doesn't knock; it costs us instead
The 'fun-nification' of computer education – good idea?
Compulsory code schools, luvvies love it, but what about Maths and Physics?
Ex-US Navy fighter pilot MIT prof: Drones beat humans - I should know
'Missy' Cummings on UAVs, smartcars and dying from boredom
Sysadmin with EBOLA? Gartner's issued advice to debug your biz
Start hoarding cleaning supplies, analyst firm says, and assume your team will scatter
Zippy one-liners, broken promises: Doctor Who on the Orient Express
Series finally hits stride, but Clara's U-turn is baffling
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.