This article is more than 1 year old

Google to fix blog noise problem

-noblog

Google is to create a search tool specifically for weblogs, most likely giving material generated by the self-publishing tools its own tab.

CEO Eric Schmidt made the announcement on Monday, at the JP Morgan Technology and Telecom conference. 'Soon the company will also offer a service for searching Web logs, known as "blogs,"' reported Reuters.

It isn't clear if weblogs will be removed from the main search results, but precedent suggests they will be. After Google acquired Usenet groups from Deja.com, it developed a unique user interface and a refined search engine, and removed the groups from the main index. After a sticky start, Usenet veterans welcomed the new interface. Google recently acquired Blogger, and sources suggest this is the most likely option.

Bloggers too are likely to welcome their very own tab as a legitimization of the publishing format. But many others will breathe a sigh of relief as blogs disappear from the main index.

"I just want a search engine that works," laments Chris Roddy, a politics and linguistics undergraduate at the University of Emory.

"I can get a Google search with porn turned off; why can't I get blogs turned off too?" he asked on Slashdot.

Google has strived in vain to maintain the quality of its search results in the face of a blizzard of links generated by a small number of sources. (Google searches 3,083,324,652 pages as of 4PM PT today. Assuming there are one million bloggers, and generously assuming they have a hundred pages each, that amounts to 0.032 per cent of web content indexed by Google. Recent research by Pew put the number of blog readers as opposed to writers, as "statistically insignificant").

However, through dense and incestuous linking, results from blogs can drown out other sources.

"The main problem with blogs is that, as far as Google is concerned, they masquerade as useful information when all they contain is idle chatter," wrote Roddy. "And through some fluke of their evil software, they seem to get indexed really fast, so when a major political or social event happens, Google is noised to the brim with blogs and you have to start at result number 40 or so before you get past the blogs." We'd noticed.

"Taking Usenet out of the general search was great, because it is not really interfering with general Internet searching," Roddy told us. "Usenet was a public forum in the first place."

A Slashot discussion prompted a suggestion that Google add a -noblog option, which it effectively appears to be introducing by default.

Gary Stock, chief technology office for Nexcerpt, Inc. agrees.

"A year or two ago you could hit 'I'm Feeling Lucky' and there was a good chance that you could find a good and authoritative page," he told us.

"It is less the case today. More and more people have more text to type, and may not have anything authoritative to say - they just throw up characters on the screen."

He says that the link-based algorithm called PageRank™ was designed, at Stanford University, with very different assumptions about the quality of information.

"They didn't foresee a tightly-bound body of wirers," reckons Stock. "They presumed that technicians at USC would link to the best papers from MIT, to the best local sites from a land trust or a river study - rather than a clique, a small group of people writing about each other constantly. They obviously bump the rankings system in a way for which it wasn't prepared."

Information Quality

For Stock and Roddy, the problem is that the resulting degradation in the quality of information makes it even harder to find primary source material. Roddy said the realization came after searching through 500 blog entries to find a primary source.

Exacerbating the problem, says Stock - who devised 'Googlewhacking', or the art of producing a search query that returns just one result - is the frequency with which the sites are indexed.

"If they are really spidering all 3 billion pages, then they must have changed some law of physics," he explains.

"Someone has made a choice whether to go to a site ever hour or every three years. That begs the question - if I know something to be a high traffic site and I train my robots to visit often, do I discount it when I feed my information to PageRank?"

For example, he cites a hypothetical.

"Suppose turtle-rescue.org has authoritative information about turtles. And it changes every month. Then BoingBoing puts up a page about turtles and that becomes a big deal.

"Each of us gets vote," jokes Stock. "And someone votes every day and I vote once every four years."

"The blogs push up very quickly up to the top of the search results."

Databases

"To me the power of what Dave Winer and Ev Williams have done, and it's great, is that I can easily publish ResourceShelf in seconds, giving me time to do other things," says respected author and librarian Gary Price. Price doesn't regard his site as a weblog, even though he uses Blogger tools, now owned by Google. Price co-authored The Invisible Web, a guide to little-known about public resources on the Internet [Amazon - review].

"But what happens when the weblog fad dies down?" he asks.

"The public think that they can put 2.1 words into Google and the best answer will appear, they don't ask how long is it taking them to get it. For the average person - its very good, but there are choices out there; and a lot of people aren't aware of them and don't know."

"You have to realize there are other information sources, and that information costs money."

"This is why New York Public Libraries has a sign 'Here's where you find the stuff that isn't in Google.' and much of this is publicly accessible," Price points out.

Or as Seth Finkelstein reminds us,"Google is good, but not God."

(We'll follow up what, and how to get it, soon).

Unearned Reputations

Ironically, the low information quality of blog-infested Google results is a consequence of bloggers' attempts to introduce community aspects to what remains a solitary activity. The auto-citation feature 'Trackback' is frequently fingered as the culprit: many search results Google returns are trackbacks.

And yet dealing with Trackback noise can be as much an opportunity as a challenge for Google's user interface designers. Just as the standalone Usenet tab allowed sophisticated metadata searching and threading, so could the Google Blog tab.

Granting 0.03 per cent of the web with its own Google Tab might rankle with some, but others could argue it produces the best of both worlds, for general Search users and webloggers.

One group is likely to protest long and hard, however: and that's people who have taken advantage of this quirk to use Google as their primary promotion channel or reputation creator. While folk whose reputations have been forged before the dawn of the blogroll will not be affected, and need not worry, the reaction may be predictable.

It's a bit like challenging a monarch with the viability of the hereditary principle: you can guess what they'll say.

Just as one-man one-vote democracy terrifies the bejesus out of some people, so surely will a fairer Google. ®

More about

TIP US OFF

Send us news


Other stories you might like