Teoma preps relaunch, wants to be Google-beater

From quantity to quality

ComputerWire: IT Industry Intelligence

Six months after its acquisition by Ask Jeeves Inc, Teoma Technologies Inc is about to relaunch its web site, where it provides a search engine it hopes will eventually become more used than Google. But the firm has a way to go before that will be achievable, and admits its offering is not fully finished.

Paul Gardi, who founded Teoma and is now VP of search technology, said recent statistics prove Teoma is well-liked by searchers. When Teoma web search results started supplementing Ask Jeeves' results (replacing DirectHit's hits), Ask.com clickthrough rates went up 25%.

But the company also has high hopes for its standalone Google-alike site, Teoma.com, as well as the potential for providing search services to unaffiliated portals. Revenue generating features such as paid URL submission and sponsored links (courtesy of Overture Services Inc) are already live on the beta site.

"We think about Ask Jeeves as a customer," said Gardi. "They're our first customer for syndicated search." In the third quarter, Teoma will start offering web search to other portals as a hosted service, moving into the market dominated by Google Inc and Inktomi Corp and others.

Teoma's special sauce is its ability to identify, on the fly, "communities" of interlinked web sites, which in turn help categorize results by topic areas rather than just keywords. The idea, sometimes known as "Kleinberg theory", uses a system of "hubs" and "authorities" to identify communities.

"Pages tend to point at one page more than any other page," said Gardi.

Authorities, which appear as the regular search hits in Teoma, are ranked according to scores generated by text analysis, link analysis and, soon, popularity (courtesy of clickthrough data from sister company DirectHit).

This is fairly similar to existing search technologies. Hubs, to be known as "resources" in the relaunched service, are a little different. Hubs are essentially sites containing topic-specific link lists, where scores of relevant pages are linked. Teoma finds hubs by looking to see which sites link to lots of authorities.

Keywords still play a part. For example, if you search for "Britney Spears", the first authority is britneyspears.com. Teoma realizes, on the fly, that the phrase is quite often accompanied by the word "pictures". You get the option to refine your search to "Britney Spears Pictures", which brings up a whole new community of sites, with worldofbritney.com as the leading authority.

But it's not a perfect system. The same pop starlet search brings up "Free Email" as an option to refine your search, and Justin Timberlake doesn't get a look in.

"In my mind we're not completely out of beta," said Gardi. "The dataset needs to be of a size that represents the web." Currently, Teoma is aware of about 900 million URLs, crawls about 400 million, but cuts out duplicates and spam to end up with a dataset of about 200 million fully indexed pages (only double the size it was last July when the service launched).

By comparison, Google has over 2 billion URLs, of which not all are full-text indexed. Gardi said Teoma's ideal index size will be between 350 million and 500 million pages. "There are not 2 billion useful pages on the web," he said.

"The conversation needs to move from quantity to quality."

The technology is based on the proven assumption that the web is not fully interconnected. With the possible exception of catch-all sites such as Yahoo!, sites tend to link to their own. Golf sites are unlikely to link to football sites, but they will link to other golf sites. The more golf sites that link to golf site X, the more likely it is to be an authority. The more golf sites are linked from site Y, the more likely that site is a golf hub.

The only previous project involving this type of system of sorting information was IBM Corp's "Clever" project. According to Gardi, this project failed and the results were never published. "The problem with [Clever] was that it took weeks to process the community for one word," he said. "200 million words would need a server farm the size of New York state. The sheer volume processing offline was just not practical. We're doing in real time."

©ComputerWire.com. All rights reserved.

Sponsored: Today’s most dangerous security threats