Feeds

Teoma preps relaunch, wants to be Google-beater

From quantity to quality

  • alert
  • submit to reddit

Top 5 reasons to deploy VMware with Tegile

ComputerWire: IT Industry Intelligence

Six months after its acquisition by Ask Jeeves Inc, Teoma Technologies Inc is about to relaunch its web site, where it provides a search engine it hopes will eventually become more used than Google. But the firm has a way to go before that will be achievable, and admits its offering is not fully finished.

Paul Gardi, who founded Teoma and is now VP of search technology, said recent statistics prove Teoma is well-liked by searchers. When Teoma web search results started supplementing Ask Jeeves' results (replacing DirectHit's hits), Ask.com clickthrough rates went up 25%.

But the company also has high hopes for its standalone Google-alike site, Teoma.com, as well as the potential for providing search services to unaffiliated portals. Revenue generating features such as paid URL submission and sponsored links (courtesy of Overture Services Inc) are already live on the beta site.

"We think about Ask Jeeves as a customer," said Gardi. "They're our first customer for syndicated search." In the third quarter, Teoma will start offering web search to other portals as a hosted service, moving into the market dominated by Google Inc and Inktomi Corp and others.

Teoma's special sauce is its ability to identify, on the fly, "communities" of interlinked web sites, which in turn help categorize results by topic areas rather than just keywords. The idea, sometimes known as "Kleinberg theory", uses a system of "hubs" and "authorities" to identify communities.

"Pages tend to point at one page more than any other page," said Gardi.

Authorities, which appear as the regular search hits in Teoma, are ranked according to scores generated by text analysis, link analysis and, soon, popularity (courtesy of clickthrough data from sister company DirectHit).

This is fairly similar to existing search technologies. Hubs, to be known as "resources" in the relaunched service, are a little different. Hubs are essentially sites containing topic-specific link lists, where scores of relevant pages are linked. Teoma finds hubs by looking to see which sites link to lots of authorities.

Keywords still play a part. For example, if you search for "Britney Spears", the first authority is britneyspears.com. Teoma realizes, on the fly, that the phrase is quite often accompanied by the word "pictures". You get the option to refine your search to "Britney Spears Pictures", which brings up a whole new community of sites, with worldofbritney.com as the leading authority.

But it's not a perfect system. The same pop starlet search brings up "Free Email" as an option to refine your search, and Justin Timberlake doesn't get a look in.

"In my mind we're not completely out of beta," said Gardi. "The dataset needs to be of a size that represents the web." Currently, Teoma is aware of about 900 million URLs, crawls about 400 million, but cuts out duplicates and spam to end up with a dataset of about 200 million fully indexed pages (only double the size it was last July when the service launched).

By comparison, Google has over 2 billion URLs, of which not all are full-text indexed. Gardi said Teoma's ideal index size will be between 350 million and 500 million pages. "There are not 2 billion useful pages on the web," he said.

"The conversation needs to move from quantity to quality."

The technology is based on the proven assumption that the web is not fully interconnected. With the possible exception of catch-all sites such as Yahoo!, sites tend to link to their own. Golf sites are unlikely to link to football sites, but they will link to other golf sites. The more golf sites that link to golf site X, the more likely it is to be an authority. The more golf sites are linked from site Y, the more likely that site is a golf hub.

The only previous project involving this type of system of sorting information was IBM Corp's "Clever" project. According to Gardi, this project failed and the results were never published. "The problem with [Clever] was that it took weeks to process the community for one word," he said. "200 million words would need a server farm the size of New York state. The sheer volume processing offline was just not practical. We're doing in real time."

©ComputerWire.com. All rights reserved.

Choosing a cloud hosting partner with confidence

More from The Register

next story
I'll be back (and forward): Hollywood's time travel tribulations
Quick, call the Time Cops to sort out this paradox!
Musicians sue UK.gov over 'zero pay' copyright fix
Everyone else in Europe compensates us - why can't you?
Megaupload overlord Kim Dotcom: The US HAS RADICALISED ME!
Now my lawyers have bailed 'cos I'm 'OFFICIALLY' BROKE
MI6 oversight report on Lee Rigby murder: US web giants offer 'safe haven for TERRORISM'
PM urged to 'prioritise issue' after Facebook hindsight find
BT said to have pulled patent-infringing boxes from DSL network
Take your license demand and stick it in your ASSIA
Right to be forgotten should apply to Google.com too: EU
And hey - no need to tell the website you've de-listed. That'll make it easier ...
prev story

Whitepapers

Designing and building an open ITOA architecture
Learn about a new IT data taxonomy defined by the four data sources of IT visibility: wire, machine, agent, and synthetic data sets.
The total economic impact of Druva inSync
Examining the ROI enterprises may realize by implementing inSync, as they look to improve backup and recovery of endpoint data in a cost-effective manner.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Business security measures using SSL
Examines the major types of threats to information security that businesses face today and the techniques for mitigating those threats.