Feeds

Teoma preps relaunch, wants to be Google-beater

From quantity to quality

  • alert
  • submit to reddit

Intelligent flash storage arrays

ComputerWire: IT Industry Intelligence

Six months after its acquisition by Ask Jeeves Inc, Teoma Technologies Inc is about to relaunch its web site, where it provides a search engine it hopes will eventually become more used than Google. But the firm has a way to go before that will be achievable, and admits its offering is not fully finished.

Paul Gardi, who founded Teoma and is now VP of search technology, said recent statistics prove Teoma is well-liked by searchers. When Teoma web search results started supplementing Ask Jeeves' results (replacing DirectHit's hits), Ask.com clickthrough rates went up 25%.

But the company also has high hopes for its standalone Google-alike site, Teoma.com, as well as the potential for providing search services to unaffiliated portals. Revenue generating features such as paid URL submission and sponsored links (courtesy of Overture Services Inc) are already live on the beta site.

"We think about Ask Jeeves as a customer," said Gardi. "They're our first customer for syndicated search." In the third quarter, Teoma will start offering web search to other portals as a hosted service, moving into the market dominated by Google Inc and Inktomi Corp and others.

Teoma's special sauce is its ability to identify, on the fly, "communities" of interlinked web sites, which in turn help categorize results by topic areas rather than just keywords. The idea, sometimes known as "Kleinberg theory", uses a system of "hubs" and "authorities" to identify communities.

"Pages tend to point at one page more than any other page," said Gardi.

Authorities, which appear as the regular search hits in Teoma, are ranked according to scores generated by text analysis, link analysis and, soon, popularity (courtesy of clickthrough data from sister company DirectHit).

This is fairly similar to existing search technologies. Hubs, to be known as "resources" in the relaunched service, are a little different. Hubs are essentially sites containing topic-specific link lists, where scores of relevant pages are linked. Teoma finds hubs by looking to see which sites link to lots of authorities.

Keywords still play a part. For example, if you search for "Britney Spears", the first authority is britneyspears.com. Teoma realizes, on the fly, that the phrase is quite often accompanied by the word "pictures". You get the option to refine your search to "Britney Spears Pictures", which brings up a whole new community of sites, with worldofbritney.com as the leading authority.

But it's not a perfect system. The same pop starlet search brings up "Free Email" as an option to refine your search, and Justin Timberlake doesn't get a look in.

"In my mind we're not completely out of beta," said Gardi. "The dataset needs to be of a size that represents the web." Currently, Teoma is aware of about 900 million URLs, crawls about 400 million, but cuts out duplicates and spam to end up with a dataset of about 200 million fully indexed pages (only double the size it was last July when the service launched).

By comparison, Google has over 2 billion URLs, of which not all are full-text indexed. Gardi said Teoma's ideal index size will be between 350 million and 500 million pages. "There are not 2 billion useful pages on the web," he said.

"The conversation needs to move from quantity to quality."

The technology is based on the proven assumption that the web is not fully interconnected. With the possible exception of catch-all sites such as Yahoo!, sites tend to link to their own. Golf sites are unlikely to link to football sites, but they will link to other golf sites. The more golf sites that link to golf site X, the more likely it is to be an authority. The more golf sites are linked from site Y, the more likely that site is a golf hub.

The only previous project involving this type of system of sorting information was IBM Corp's "Clever" project. According to Gardi, this project failed and the results were never published. "The problem with [Clever] was that it took weeks to process the community for one word," he said. "200 million words would need a server farm the size of New York state. The sheer volume processing offline was just not practical. We're doing in real time."

©ComputerWire.com. All rights reserved.

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
WHY did Sunday Mirror stoop to slurping selfies for smut sting?
Tabloid splashes, MP resigns - but there's a BIG copyright issue here
Spies, avert eyes! Tim Berners-Lee demands a UK digital bill of rights
Lobbies tetchy MPs 'to end indiscriminate online surveillance'
How the FLAC do I tell MP3s from lossless audio?
Can you hear the difference? Can anyone?
Inequality increasing? BOLLOCKS! You heard me: 'Screw the 1%'
There's morality and then there's economics ...
Google hits back at 'Dear Rupert' over search dominance claims
Choc Factory sniffs: 'We're not pirate-lovers - also, you publish The Sun'
EU to accuse Ireland of giving Apple an overly peachy tax deal – report
Probe expected to say single-digit rate was unlawful
While you queued for an iPhone 6, Apple's Cook sold shares worth $35m
Right before the stock took a 3.8% dive amid bent and broken mobe drama
prev story

Whitepapers

A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Beginner's guide to SSL certificates
De-mystify the technology involved and give you the information you need to make the best decision when considering your online security options.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.