Feeds

Search pioneers join Yahoo! - but is the web beyond search?

Clever move - too late?

  • alert
  • submit to reddit

Beginner's guide to SSL certificates

Few visitors to IBM's Almaden research lab in 1999 and 2000 can fail to have been impressed by its lead in web search. IBM's Clever project both predated and informed what became Google: Brin and Page cited the Almaden work in their 1998 paper The Anatomy of a Large-Scale Hypertextual Web Search Engine [pdf, 124kb]. Google drew on the same concept, which they were to trademark and market as PageRank™ of using the link structure to infer quality and authority.

But the Clever team was already thinking way beyond PageRank™. Your reporter was one such visitor more than five years ago and was struck by the scope and depth of the work. For example, in 1998 the Clever team was publishing its research into hierarchical topic taxonomies, and inferring web communities. Today, such subjects are presented to conferences of former HTML coders (today's wiki-fiddlers) who appear to be hearing the topics for the first time, such is their wide-eyed wonderment.

Working within IBM also allowed the team to draw on its rich history of database research and linguistic analysis, and at IBM you try not to lose your customers' data.

Google's fate is well known. After last year's IPO it became one of the wealthiest technology companies on the planet, and its founders are billionaires.

And Clever?

Well, IBM appeared to have some inkling that the project was valuable to it. A spin-off was discussed, but never followed through, and IBM officially welcomed licensees at one stage. But Clever was never allowed the opportunity to compete directly with the commercial search rivals, so we never really saw its potential.

Clever's trajectory in some ways mirrors that of IBM's relational database work. With its System*R project, IBM had built the first implementation of the Relational Database in the early 70s, but bureaucratic infighting hampered the researchers' desire to turn it into real product for IBM's customers. Ingres was first to get an RDBMS out of the door and Oracle's single-minded marketing won it big inroads into the new market in the 1980s.

"We were convinced IBM would never ship" Jim Gray later recalled (in one of the best oral histories of a computer project on the net).

Now, however, Yahoo! has hired several of the Clever team and plans to recruit more.

Last week the New York Times reported that Prabhakar Raghavan, one-time project leader had been recruited from Verity, where he was chief scientist and CTO. Another staffer, Andrew Tomkins, is also on his way to Yahoo!, the Times reported.

These guys have their work cut out.

Web chaff beyond sorting?

"The World Wide Web of today is dramatically different from that of just five years ago," the team noted in 1999. "Predicting what it will be like in another five years (2004) seems futile. Will even the basic act of indexing the Web soon become infeasible?"

For a few years, it looked an improbably pessimistic question. But pessimists make the best engineers in the long run, and this now seems prescient.

Google's link-based algorithms were soon imitated by rivals, and as a consequence all today's search engines today must now mine a web stuffed with synthetic documents of little relevance to anyone, many of which are generated by machines on behalf of the customers of the more unscrupulous SEOs (Search Engine Optimizers)

It's an algorithm arms race, and the SEOs themselves know the scale of the problem they nurtured. Some estimate as much as a third of the web is fake, machine-generated pages and Google can't really tell which third it is. Meanwhile, neither Yahoo! Google nor MSN can still offer the most basic improvements on what AltaVista offered in 1996. queries sorted by date. Want a listing of Tony Blair's comments about Iraq published between June and August 2003? Forget it. AltaVista could do this then, and still can, but none of the big three can match this most basic of requests

Because rigging the search engines is so profitable, the junk web or "Web 2.0" as it's called, proliferates and mutates like a superbug. Each new solution to the problem is quickly co-opted by spammers and gamers. For example, last year's "tagging" craze is becoming this year's mortgage and Viagra scam.

Some maintain the web's problems can't be solved technically - but only politically or economically, for example by the application of compensation models which allow the really good data hoarded by database holders to be opened to the public at last. That may prove to be true: the are many flavors of private and public networks, we use a mixture every day, and that mixture will change over time.

The reassembled Clever team at Yahoo! may not even be offered a chance to answer the question.

The Times reports that the team itself is being directed to searching digital media, and hints that some areas of their earlier work remain IBM's intellectual property.

By some irony, we note that one of Sergey Brin's student projects was also searching digital media, only as a kind of RIAA enforcer. The system he developed was for the "automated detection of copyright violations", and was unfortunately called COPS (the COpyright Protection System). Fortunately, Sergey was more interested in developing a general purpose data mining application.

Would he make the same choice today?

Surely something must be done to renew the original raison d'etre behind both Google and Yahoo! - finding good stuff. The world in which an "I'm Feeling Lucky" button was even conceivable seems to belong to a distant past.

Google would rather sell you a shirt on Froogle, and Yahoo! would rather show you the way to the Coliseum, offering you a package tour that includes the ticket admission. And the former search leader's priorities seem to be elsewhere. In recent months Google has patented a widely used business method and beefed up its DC lobbying muscle, and last week's legal dispute over the hiring of a "search expert" by Google from Microsoft sounded thoroughly phoney and synthetic on both sides.

The Clever team that Yahoo! is reassembling are the genuine article. Perhaps if the management permits them, they'll be able to answer the question -

Whatever happened to search?®

Related stories

Google seeks RSS ad patent
Yahoo! buys! bloated! widgets!
Search Wars - the Empire strikes back
Are you trying to be funny? If so check [ ] this box
Lookout, France! Google hires neo-con headbanger
Strength through pessimism! Keeping your stuff safe

Providing a secure and efficient Helpdesk

More from The Register

next story
Scrapping the Human Rights Act: What about privacy and freedom of expression?
Justice minister's attack to destroy ability to challenge state
WHY did Sunday Mirror stoop to slurping selfies for smut sting?
Tabloid splashes, MP resigns - but there's a BIG copyright issue here
Google hits back at 'Dear Rupert' over search dominance claims
Choc Factory sniffs: 'We're not pirate-lovers - also, you publish The Sun'
EU to accuse Ireland of giving Apple an overly peachy tax deal – report
Probe expected to say single-digit rate was unlawful
Inequality increasing? BOLLOCKS! You heard me: 'Screw the 1%'
There's morality and then there's economics ...
Hey Brit taxpayers. You just spent £4m on Central London ‘innovation playground’
Catapult me a Mojito, I feel an Digital Innovation coming on
While you queued for an iPhone 6, Apple's Cook sold shares worth $35m
Right before the stock took a 3.8% dive amid bent and broken mobe drama
EU probes Google’s Android omerta again: Talk now, or else
Spill those Android secrets, or we’ll fine you
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.