Google bug blocks thousands of sites

Choking on spam, noise - 'Bots are 'ready to give up'

Google, like the rest of us, seems to be fighting a losing battle to make sense of a rising tide of Internet garbage. But a programming error by the search engine has compounded the problem: by inadvertently blocking thousands of sites from Google users.

It's been called a "Google-NACK": you enter a particular search term and Google tells you that there are thousands of matching results, but fails to return many, or any results.

For example, a search for keyboard bracelet returns just five sites out of "about 49,900". (Your mileage may vary, as Google results differ depending on where you are, and which way the Segway scooters are pointing - but it's a fairly typical figure.)

What's happening? Award-winning researcher Seth Finkelstein has a theory why. Google's own spam filters, designed to weed out link farms created by pornographers and spammers and Scientologists, are crude, and are blocking many innocent sites.

"Technical solutions may have unintended consequences," he says.

"When Google searches for combinations of terms, pages with the terms close to each other are ranked highly. Such pages are also unfortunately often search spam pages, using a mismash of keywords. Thus, an unusual combination of words (and a dedicated spammer) will bring spam pages near the top of the results for certain keyword searches."

Perfect storm

One such example is Elwyn Jenkins, a spammer and former e-currency evangelist now based in Australia, who touted a pamphlet called "Make Money Online" - which boasted that "Dr. Jenkins has pioneered a unique approach to using Google and blogs to build traffic." Jenkins used a link farm using the domains www.microdoc-news.info, www.microdocs-news.info, smoogle.info, googlevillage.info, blogging-news.info, googlology.info, microdoc.bloki.com, www.question-factory.com, meeting-mentor.blogspot.com, radio.weblogs.com/0111745, verityintellectualproperties.com, textchunk.info, personalbrain.info, technacy.info, verity-ip.com, bloggers-news.info and ...

well, you get the picture. His Googlephilia was returned in kind by bloggers, who pumped up his PageRank™ (PageRank™s fatal flaw was incestuous linking) by linking to him approvingly. So creating a perfect storm - and an almighty headache - for Google's algorithm overlords.

The term GoogleNACK ('Negative ACKnowledgement') was coined by Gary Stock, CTO of Nexcerpt, a web clipping service that monitors thousands of news sources. Stock coined the phrase Googlewhack, sharing his research with Google.

In an effort to weed out the noise, Google constantly refines its weighting algorithm, which it says is a combination of a hundred different factors. In an attempt to thwart deliberate gaming by link farms and blog noise (exacerbated by lossy software gimmicks such as 'Trackbacks', which generate reams of content-free pages for Google's crawlers), Google has stepped back from its trademarked PageRank™ method and instead, emphasized more traditional factors such as anchor text.

"I'd say the people to *whack* here are those search-spammers
who are causing the problem and requiring Google's defense," says Finkelstein.

But all factors, once known, are susceptible to gaming, and perhaps no one search engine can ever hope to win an arms race against unscrupulous and determined spammers. Although calls have increased for Google to be regulated, perhaps the best defense is simply common sense: other search engines deliver surprising results that Google can't, and a wise browser will use a combination of tools. It certainly helps to shop around.

Or ask a librarian. ®

Sponsored: Designing and building an open ITOA architecture