Feeds

My spam-filled search index is bigger than yours!

NCSA study backfires

Security for virtualized datacentres

Last week Yahoo! claimed it had sailed past Google by indexing 20 billion web pages.

Because as much as a third of the wild wild web consists of artificially-generated pages of spam designed to promote commercial web sites, this isn't much to boast about. Many of the fake pages are 'splogs', or spam blogs, or phoney catalogs, or simply pages of dictionary words. You can meet one of the perps here, in our story Interview with a link spammer.

And because few users have the patience to find the gem returned at position #12,711 in the search results, then the size of an index is meaningless.

More is never a substitute for better.

Unofficial Google spokesperson John Battelle has disputed the claim, and now the supercomputer center NCSA has poured cold water on the boast, too.

In a sample comparison, three researchers University of Illinois at Urbana-Champaign found that Google returned more search results than Yahoo! They conclude,

"It is the opinion of this study that Yahoo!'s claim to have a web index of over twice as many documents as Google's index is suspicious. Unless a large number of the documents Yahoo! has indexed are not yet available to its search engine, we find it puzzling that Yahoo!'s search engine consistently returned fewer results than Google."

A bold claim, for sure. There's a problem, however - NCSA's results are a little too incredible.

As Seth Finkelstein points out on his weblog, the NCSA researchers fed random dictionary words into both search engines - and not surprisingly, the results reflect pages containing random dictionary words.

"By sampling random words, they biased the samples to files of large word lists! And this effect applies, to a great or lesser extent, to every sample.

Here's an example. For the spam friendly gibberish words "carbolization clambers" Google returned 7 pages, all from a dictionary, and Yahoo! returned none. For the words " alkaloid's observance", Google returns 30 pages and Yahoo none.

In other words, the methodology is geared not to measure who has the most useful documents, but who has the most spam. To be more precise, in these examples, Google returns a number of copies of a dictionary file. It's a different frequency of noise.

"Just to show the problem, imagine that Google had returned results of three dictionary files, and Yahoo one dictionary file," explains Finkelstein. "Do this *10,000* times, and you get Google returning 30,000 results, and Yahoo! returning 10,000 results. So, wow, Google has 3x the size! But, in fact, it's just the same little quirk being counted 10,000 times over!"

The intrepid NCSA researchers appear to be blissfully unaware of that they've marched into a swamp.

"We feel that small, randomly selected search queries gives us the best chance to locate some of the most obscure web documents," they insist. Nevertheless, they plough bravely on.

"By counting the presence of these obscure documents in either search engine, we can measure the comprehensiveness of each search engine to determine the relative size of each search engine's index," they maintain.

Oh, dear.

The Argentian writer Borges described the Falklands War as "two bald men fighting over a comb", and it's an apt description for the spectacle of the search engines, and their rival supporters, duking it out to boast who's has collected the most garbage. ®

Beginner's guide to SSL certificates

More from The Register

next story
Bono apologises for iTunes album dump
Megalomania, generosity and FEAR of irrelevance drove group to Apple deal
HBO shocks US pay TV world: We're down with OTT. Netflix says, 'Gee'
This affects every broadcaster, every cable guy
Facebook, Apple: LADIES! Why not FREEZE your EGGS? It's on the company!
No biological clockwatching when you work in Silicon Valley
French 'terror law' declares WAR on the INTERNET itself, say digi-rights folks
Liberté, Égalité, Fraternité: Two out of three ain't bad
SCREW YOU, EU: BBC rolls out Right To Remember as Google deletes links
Not even Google can withstand the power of Auntie
Arab States make play for greater government control of the internet
Nerds told to get lost in last-minute power grab bid at UN meeting
Zippy one-liners, broken promises: Doctor Who on the Orient Express
Series finally hits stride, but Clara's U-turn is baffling
Don't bother telling people if you lose their data, say Euro bods
You read that right – with the proviso that it's encrypted
America's super-secret X-37B plane returns to Earth after nearly TWO YEARS aloft
674 days in space for US Air Force's mystery orbital vehicle
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.