Feeds

Teoma preps relaunch, wants to be Google-beater

From quantity to quality

  • alert
  • submit to reddit

The essential guide to IT transformation

ComputerWire: IT Industry Intelligence

Six months after its acquisition by Ask Jeeves Inc, Teoma Technologies Inc is about to relaunch its web site, where it provides a search engine it hopes will eventually become more used than Google. But the firm has a way to go before that will be achievable, and admits its offering is not fully finished.

Paul Gardi, who founded Teoma and is now VP of search technology, said recent statistics prove Teoma is well-liked by searchers. When Teoma web search results started supplementing Ask Jeeves' results (replacing DirectHit's hits), Ask.com clickthrough rates went up 25%.

But the company also has high hopes for its standalone Google-alike site, Teoma.com, as well as the potential for providing search services to unaffiliated portals. Revenue generating features such as paid URL submission and sponsored links (courtesy of Overture Services Inc) are already live on the beta site.

"We think about Ask Jeeves as a customer," said Gardi. "They're our first customer for syndicated search." In the third quarter, Teoma will start offering web search to other portals as a hosted service, moving into the market dominated by Google Inc and Inktomi Corp and others.

Teoma's special sauce is its ability to identify, on the fly, "communities" of interlinked web sites, which in turn help categorize results by topic areas rather than just keywords. The idea, sometimes known as "Kleinberg theory", uses a system of "hubs" and "authorities" to identify communities.

"Pages tend to point at one page more than any other page," said Gardi.

Authorities, which appear as the regular search hits in Teoma, are ranked according to scores generated by text analysis, link analysis and, soon, popularity (courtesy of clickthrough data from sister company DirectHit).

This is fairly similar to existing search technologies. Hubs, to be known as "resources" in the relaunched service, are a little different. Hubs are essentially sites containing topic-specific link lists, where scores of relevant pages are linked. Teoma finds hubs by looking to see which sites link to lots of authorities.

Keywords still play a part. For example, if you search for "Britney Spears", the first authority is britneyspears.com. Teoma realizes, on the fly, that the phrase is quite often accompanied by the word "pictures". You get the option to refine your search to "Britney Spears Pictures", which brings up a whole new community of sites, with worldofbritney.com as the leading authority.

But it's not a perfect system. The same pop starlet search brings up "Free Email" as an option to refine your search, and Justin Timberlake doesn't get a look in.

"In my mind we're not completely out of beta," said Gardi. "The dataset needs to be of a size that represents the web." Currently, Teoma is aware of about 900 million URLs, crawls about 400 million, but cuts out duplicates and spam to end up with a dataset of about 200 million fully indexed pages (only double the size it was last July when the service launched).

By comparison, Google has over 2 billion URLs, of which not all are full-text indexed. Gardi said Teoma's ideal index size will be between 350 million and 500 million pages. "There are not 2 billion useful pages on the web," he said.

"The conversation needs to move from quantity to quality."

The technology is based on the proven assumption that the web is not fully interconnected. With the possible exception of catch-all sites such as Yahoo!, sites tend to link to their own. Golf sites are unlikely to link to football sites, but they will link to other golf sites. The more golf sites that link to golf site X, the more likely it is to be an authority. The more golf sites are linked from site Y, the more likely that site is a golf hub.

The only previous project involving this type of system of sorting information was IBM Corp's "Clever" project. According to Gardi, this project failed and the results were never published. "The problem with [Clever] was that it took weeks to process the community for one word," he said. "200 million words would need a server farm the size of New York state. The sheer volume processing offline was just not practical. We're doing in real time."

©ComputerWire.com. All rights reserved.

5 things you didn’t know about cloud backup

More from The Register

next story
True fact: 1 in 4 Brits are now TERRORISTS
YouGov poll reveals terrible truth about the enemy within
Microsoft exits climate denier lobby group
ALEC will have to do without Redmond, it seems
Caught red-handed: UK cops, PCSOs, specials behaving badly… on social media
No Mr Fuzz, don't ask a crime victim to be your pal on Facebook
Barnes & Noble: Swallow a Samsung Nook tablet, please ... pretty please
Novelslab finally on sale with ($199 - $20) price tag
Ballmer leaves Microsoft board to spend more time with his b-balls
From Clippy to Clippers: Hi, I see you're running an NBA team now ...
Video of US journalist 'beheading' pulled from social media
Yanked footage featured British-accented attacker and US journo James Foley
Primetime precrime? Minority Report TV series 'being developed'
I have to know. I have to find out what happened to my life
Assange™: Hey world, I'M STILL HERE, ignore that Snowden guy
Press conference: ME ME ME ME ME ME ME (cont'd pg 94)
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
7 Elements of Radically Simple OS Migration
Avoid the typical headaches of OS migration during your next project by learning about 7 elements of radically simple OS migration.
BYOD's dark side: Data protection
An endpoint data protection solution that adds value to the user and the organization so it can protect itself from data loss as well as leverage corporate data.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?