Feeds

Pimplier Batgirls and Sawdusty Barmen

Whacking Google's wordlist woes

Security for virtualized datacentres

Earlier this week the NCSA released a study that attempted to compare the respective merits of Google and Yahoo!'s search engines. (See My spam-filled search index is bigger than yours!). Unfortunately, the only thing it proved was which search engine was publishing the most gibberish it had collected - a fact apparently lost on the researchers. The three academics insisted that because Google was returning more gibberish, it must be doing a better job.

Doh!

The phenomenon, we discover, is relatively recent, and it's an unintended consequence of both search engines trying to make their searches more comprehensive. The trouble is that Google is returning pages which are nothing but great long lists of words as valid search results, when rarely, if ever, is this what the searcher is looking for. Unless you have a thing for strange combinations of words.

But strangely enough, some people do have a thing for strange combinations of words - and this week, it's these very hobbyists who have been able to shed more light on the search giants' internal operations than the academics. It's Google Whacking - the art of finding two words that produce just a single result from the search engine - which comedy writer Dave Gorman turned into a book and a stage show. You can see a list of the most recently discovered Google Whacks here. Current Whacks at the time of writing include " rhubarb underkill", "oxymoronically flakier", "overpaid brainworkers" - somewhat surprisingly - and, to our relief, there's only one web page in the world with the words "subhuman stepsiblings" close enough together to merit a hit.

Don't expect these Whacks to work for long - as soon as they're repeated on a public forum, and Google finds them, they cease to be Whacks. Today's most delightful Whacks, like "pimplier batgirls" and "sawdusty barmen", are sure to be gone tomorrow.

Search guru Gary Stock, CTO of Nexcerpt, who coined the phrase, says things have been very strange indeed recently in the world of Google and Yahoo!

What the Whackers have highlighted is how the major search engines are struggling to cope with word lists - like this one - with Yahoo! doing a rather better job than Google. It's a consequence, he says, of both search engines removing the cap from the length of documents that they index.

You may have marvelled how so many web pages appeared to be precisely 101kb long. Well, for some years Google only indexed the first 101kb of a web page, and that's all you could see. A year ago Yahoo! lifted the cap, and more recently Google followed suit. The result has been to open the door to more gibberish.

"The contribution was distinct and dramatic," says Stock. "Googlewhackers were commenting on the change well before it was noted in most search forums. We're exposed to a lot of raw, peculiar stuff around the edges."

Google was already pretty good at recognizing duplicates, he points out. But not quite good enough. Surely this isn't rocket science to fix, we wondered?

"Google has the advantage of knowing precisely what is in their corpus, and being able to call upon vast research and statistical data about what is - and what is not - legitimate text. Wordlists look exactly like … lists of words!

"Here's a clue for low hanging fruit: legitimate text won't include a sequence of 1,000 words in alphabetical order - with no punctuation!"

"If Google wants to exclude other machine-generated texts (search 'hipcrime' or 'sporgeries' for many scary instances), that'll require a bit more effort. Again though, they have plenty of data to create useful profiles, and plenty of CPU to apply them."

"If Googlewhack's code can identify wordlists with a high degree of accuracy, then Google could readily make it part of The Machine."

That's a point made by reader Jeremy Pickens, although it may sound counter-intuitive on the first pass.

"Just because a page never gets returned in the top 1,000 doesn't mean there is no use in having that page in your index," he writes. "For example, if Yahoo! were to do some sort of statistical or structural analysis of those additional 10 billion spam pages, they might be able to use all that information to better detect/classify search spam in the future."

"So, kudos to Yahoo! for actually taking the disk space to store all that extra information, instead of just throwing it away."

Reader David notes, "The bottom line seems to be that Google is better at including obscure pages or Yahoo is better at rejecting useless pages and the NCSA excels at keeping incompetent PhDs funded."®

Bootnote: Martin Torzewski earning himself the last word with this thought. "Your reference to Borges prompted the thought that his story "The Library of Babel" might be relevant here!" Too true, alas.

Choosing a cloud hosting partner with confidence

More from The Register

next story
Ex-US Navy fighter pilot MIT prof: Drones beat humans - I should know
'Missy' Cummings on UAVs, smartcars and dying from boredom
Facebook, Apple: LADIES! Why not FREEZE your EGGS? It's on the company!
No biological clockwatching when you work in Silicon Valley
The 'fun-nification' of computer education – good idea?
Compulsory code schools, luvvies love it, but what about Maths and Physics?
Doctor Who's Flatline: Cool monsters, yes, but utterly limp subplots
We know what the Doctor does, stop going on about it already
'Cowardly, venomous trolls' threatened with TWO-YEAR sentences for menacing posts
UK government: 'Taking a stand against a baying cyber-mob'
Happiness economics is bollocks. Oh, UK.gov just adopted it? Er ...
Opportunity doesn't knock; it costs us instead
Sysadmin with EBOLA? Gartner's issued advice to debug your biz
Start hoarding cleaning supplies, analyst firm says, and assume your team will scatter
Don't bother telling people if you lose their data, say Euro bods
You read that right – with the proviso that it's encrypted
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.