Feeds

Search engines stink, and they're getting worse

An academic report finds them all sadly wanting

  • alert
  • submit to reddit

New hybrid storage solutions

The proportion of information on the Internet that is indexed by search engines is declining, according to a recent study by Steve Lawrence and Lee Giles of the NEC Research Institute at Princeton, and reported in Nature. The engines do not index sites equally, new pages may remain unindexed for months, but worst of all, even the best engine only reaches 16 per cent of the Web. The survey was carried out in February. Furthermore, the situation is getting worse, since in December 1997 around 34 per cent of information was indexed. The problem is multifaceted. The Web has around 15 terabytes of data in some 800 million pages, plus 180 million images and is growing faster than the ability of the engines to search. The growth rate is about 3 million pages/day. There is apparently no coordination between search engine operators it seems, so that the cognoscenti could turn to a particular engine for a particular subject area. The dark side is of course that many pages that have made it to a search engine disappear without trace. So how well are the best-known engines doing? Dismally, is the answer. The best, according to Lawrence & Giles, is Northern Light, which covers a mere 16 per cent of the Web, just pipping Alta Vista's 15.5 percent (although that 0.5 per cent difference adds up to around 4 million unindexed pages). Microsoft can only manage 8.5 per cent, Yahoo 7.4 per cent, Excite 5.6 per cent, and Lycos is the dunce at 2.5 per cent. This should make people think about defaulting to using engines on portals. Of course enlightened searchers use meta-engines that use several engines and combine the results, but each has its inconveniences and idiosyncrasies we have found. The researchers found that 83 per cent of web sites have commercial content, with only a vociferous 1.5 per cent of sites being pornographic: they, at least, have found the secrets of tweaking their sites to get them indexed, it would appear. It looks as though archivists will not be out of a job for a long time, in view of this failure of the Web as a reliable and comprehensive online library. It reminds us of the persistent story that the French Bibliotheque Nationale used to store its books by size and colour in its old building. That's effectively what's happening on the Web: we don't know how much is unindexed, because it is hard to study the overlap between engines, but the odds are that half the information on the Web cannot be found with search engines at all. ®

The next step in data security

More from The Register

next story
Phones 4u slips into administration after EE cuts ties with Brit mobe retailer
More than 5,500 jobs could be axed if rescue mission fails
JINGS! Microsoft Bing called Scots indyref RIGHT!
Redmond sporran metrics get one in the ten ring
Driving with an Apple Watch could land you with a £100 FINE
Bad news for tech-addicted fanbois behind the wheel
Murdoch to Europe: Inflict MORE PAIN on Google, please
'Platform for piracy' must be punished, or it'll kill us in FIVE YEARS
Phones 4u website DIES as wounded mobe retailer struggles to stay above water
Founder blames 'ruthless network partners' for implosion
Found inside ISIS terror chap's laptop: CELINE DION tunes
REPORT: Stash of terrorist material found in Syria Dell box
Sony says year's losses will be FOUR TIMES DEEPER than thought
Losses of more than $2 BILLION loom over troubled Japanese corp
Show us your Five-Eyes SECRETS says Privacy International
Refusal to disclose GCHQ canteen menus and prices triggers Euro Human Rights Court action
prev story

Whitepapers

Providing a secure and efficient Helpdesk
A single remote control platform for user support is be key to providing an efficient helpdesk. Retain full control over the way in which screen and keystroke data is transmitted.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.
Protecting users from Firesheep and other Sidejacking attacks with SSL
Discussing the vulnerabilities inherent in Wi-Fi networks, and how using TLS/SSL for your entire site will assure security.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.