Feeds

Google to fix blog noise problem

-noblog

  • alert
  • submit to reddit

Security for virtualized datacentres

Google is to create a search tool specifically for weblogs, most likely giving material generated by the self-publishing tools its own tab.

CEO Eric Schmidt made the announcement on Monday, at the JP Morgan Technology and Telecom conference. 'Soon the company will also offer a service for searching Web logs, known as "blogs,"' reported Reuters.

It isn't clear if weblogs will be removed from the main search results, but precedent suggests they will be. After Google acquired Usenet groups from Deja.com, it developed a unique user interface and a refined search engine, and removed the groups from the main index. After a sticky start, Usenet veterans welcomed the new interface. Google recently acquired Blogger, and sources suggest this is the most likely option.

Bloggers too are likely to welcome their very own tab as a legitimization of the publishing format. But many others will breathe a sigh of relief as blogs disappear from the main index.

"I just want a search engine that works," laments Chris Roddy, a politics and linguistics undergraduate at the University of Emory.

"I can get a Google search with porn turned off; why can't I get blogs turned off too?" he asked on Slashdot.

Google has strived in vain to maintain the quality of its search results in the face of a blizzard of links generated by a small number of sources. (Google searches 3,083,324,652 pages as of 4PM PT today. Assuming there are one million bloggers, and generously assuming they have a hundred pages each, that amounts to 0.032 per cent of web content indexed by Google. Recent research by Pew put the number of blog readers as opposed to writers, as "statistically insignificant").

However, through dense and incestuous linking, results from blogs can drown out other sources.

"The main problem with blogs is that, as far as Google is concerned, they masquerade as useful information when all they contain is idle chatter," wrote Roddy. "And through some fluke of their evil software, they seem to get indexed really fast, so when a major political or social event happens, Google is noised to the brim with blogs and you have to start at result number 40 or so before you get past the blogs." We'd noticed.

"Taking Usenet out of the general search was great, because it is not really interfering with general Internet searching," Roddy told us. "Usenet was a public forum in the first place."

A Slashot discussion prompted a suggestion that Google add a -noblog option, which it effectively appears to be introducing by default.

Gary Stock, chief technology office for Nexcerpt, Inc. agrees.

"A year or two ago you could hit 'I'm Feeling Lucky' and there was a good chance that you could find a good and authoritative page," he told us.

"It is less the case today. More and more people have more text to type, and may not have anything authoritative to say - they just throw up characters on the screen."

He says that the link-based algorithm called PageRank™ was designed, at Stanford University, with very different assumptions about the quality of information.

"They didn't foresee a tightly-bound body of wirers," reckons Stock. "They presumed that technicians at USC would link to the best papers from MIT, to the best local sites from a land trust or a river study - rather than a clique, a small group of people writing about each other constantly. They obviously bump the rankings system in a way for which it wasn't prepared."

Information Quality

For Stock and Roddy, the problem is that the resulting degradation in the quality of information makes it even harder to find primary source material. Roddy said the realization came after searching through 500 blog entries to find a primary source.

Exacerbating the problem, says Stock - who devised 'Googlewhacking', or the art of producing a search query that returns just one result - is the frequency with which the sites are indexed.

"If they are really spidering all 3 billion pages, then they must have changed some law of physics," he explains.

"Someone has made a choice whether to go to a site ever hour or every three years. That begs the question - if I know something to be a high traffic site and I train my robots to visit often, do I discount it when I feed my information to PageRank?"

For example, he cites a hypothetical.

"Suppose turtle-rescue.org has authoritative information about turtles. And it changes every month. Then BoingBoing puts up a page about turtles and that becomes a big deal.

"Each of us gets vote," jokes Stock. "And someone votes every day and I vote once every four years."

"The blogs push up very quickly up to the top of the search results."

Databases

"To me the power of what Dave Winer and Ev Williams have done, and it's great, is that I can easily publish ResourceShelf in seconds, giving me time to do other things," says respected author and librarian Gary Price. Price doesn't regard his site as a weblog, even though he uses Blogger tools, now owned by Google. Price co-authored The Invisible Web, a guide to little-known about public resources on the Internet [Amazon - review].

"But what happens when the weblog fad dies down?" he asks.

"The public think that they can put 2.1 words into Google and the best answer will appear, they don't ask how long is it taking them to get it. For the average person - its very good, but there are choices out there; and a lot of people aren't aware of them and don't know."

"You have to realize there are other information sources, and that information costs money."

"This is why New York Public Libraries has a sign 'Here's where you find the stuff that isn't in Google.' and much of this is publicly accessible," Price points out.

Or as Seth Finkelstein reminds us,"Google is good, but not God."

(We'll follow up what, and how to get it, soon).

Unearned Reputations

Ironically, the low information quality of blog-infested Google results is a consequence of bloggers' attempts to introduce community aspects to what remains a solitary activity. The auto-citation feature 'Trackback' is frequently fingered as the culprit: many search results Google returns are trackbacks.

And yet dealing with Trackback noise can be as much an opportunity as a challenge for Google's user interface designers. Just as the standalone Usenet tab allowed sophisticated metadata searching and threading, so could the Google Blog tab.

Granting 0.03 per cent of the web with its own Google Tab might rankle with some, but others could argue it produces the best of both worlds, for general Search users and webloggers.

One group is likely to protest long and hard, however: and that's people who have taken advantage of this quirk to use Google as their primary promotion channel or reputation creator. While folk whose reputations have been forged before the dawn of the blogroll will not be affected, and need not worry, the reaction may be predictable.

It's a bit like challenging a monarch with the viability of the hereditary principle: you can guess what they'll say.

Just as one-man one-vote democracy terrifies the bejesus out of some people, so surely will a fairer Google. ®

Business security measures using SSL

More from The Register

next story
Phones 4u slips into administration after EE cuts ties with Brit mobe retailer
More than 5,500 jobs could be axed if rescue mission fails
JINGS! Microsoft Bing called Scots indyref RIGHT!
Redmond sporran metrics get one in the ten ring
Driving with an Apple Watch could land you with a £100 FINE
Bad news for tech-addicted fanbois behind the wheel
Murdoch to Europe: Inflict MORE PAIN on Google, please
'Platform for piracy' must be punished, or it'll kill us in FIVE YEARS
Phones 4u website DIES as wounded mobe retailer struggles to stay above water
Founder blames 'ruthless network partners' for implosion
Found inside ISIS terror chap's laptop: CELINE DION tunes
REPORT: Stash of terrorist material found in Syria Dell box
Sony says year's losses will be FOUR TIMES DEEPER than thought
Losses of more than $2 BILLION loom over troubled Japanese corp
Show us your Five-Eyes SECRETS says Privacy International
Refusal to disclose GCHQ canteen menus and prices triggers Euro Human Rights Court action
prev story

Whitepapers

Providing a secure and efficient Helpdesk
A single remote control platform for user support is be key to providing an efficient helpdesk. Retain full control over the way in which screen and keystroke data is transmitted.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Security and trust: The backbone of doing business over the internet
Explores the current state of website security and the contributions Symantec is making to help organizations protect critical data and build trust with customers.