Feeds

Teoma preps relaunch, wants to be Google-beater

From quantity to quality

  • alert
  • submit to reddit

High performance access to file storage

ComputerWire: IT Industry Intelligence

Six months after its acquisition by Ask Jeeves Inc, Teoma Technologies Inc is about to relaunch its web site, where it provides a search engine it hopes will eventually become more used than Google. But the firm has a way to go before that will be achievable, and admits its offering is not fully finished.

Paul Gardi, who founded Teoma and is now VP of search technology, said recent statistics prove Teoma is well-liked by searchers. When Teoma web search results started supplementing Ask Jeeves' results (replacing DirectHit's hits), Ask.com clickthrough rates went up 25%.

But the company also has high hopes for its standalone Google-alike site, Teoma.com, as well as the potential for providing search services to unaffiliated portals. Revenue generating features such as paid URL submission and sponsored links (courtesy of Overture Services Inc) are already live on the beta site.

"We think about Ask Jeeves as a customer," said Gardi. "They're our first customer for syndicated search." In the third quarter, Teoma will start offering web search to other portals as a hosted service, moving into the market dominated by Google Inc and Inktomi Corp and others.

Teoma's special sauce is its ability to identify, on the fly, "communities" of interlinked web sites, which in turn help categorize results by topic areas rather than just keywords. The idea, sometimes known as "Kleinberg theory", uses a system of "hubs" and "authorities" to identify communities.

"Pages tend to point at one page more than any other page," said Gardi.

Authorities, which appear as the regular search hits in Teoma, are ranked according to scores generated by text analysis, link analysis and, soon, popularity (courtesy of clickthrough data from sister company DirectHit).

This is fairly similar to existing search technologies. Hubs, to be known as "resources" in the relaunched service, are a little different. Hubs are essentially sites containing topic-specific link lists, where scores of relevant pages are linked. Teoma finds hubs by looking to see which sites link to lots of authorities.

Keywords still play a part. For example, if you search for "Britney Spears", the first authority is britneyspears.com. Teoma realizes, on the fly, that the phrase is quite often accompanied by the word "pictures". You get the option to refine your search to "Britney Spears Pictures", which brings up a whole new community of sites, with worldofbritney.com as the leading authority.

But it's not a perfect system. The same pop starlet search brings up "Free Email" as an option to refine your search, and Justin Timberlake doesn't get a look in.

"In my mind we're not completely out of beta," said Gardi. "The dataset needs to be of a size that represents the web." Currently, Teoma is aware of about 900 million URLs, crawls about 400 million, but cuts out duplicates and spam to end up with a dataset of about 200 million fully indexed pages (only double the size it was last July when the service launched).

By comparison, Google has over 2 billion URLs, of which not all are full-text indexed. Gardi said Teoma's ideal index size will be between 350 million and 500 million pages. "There are not 2 billion useful pages on the web," he said.

"The conversation needs to move from quantity to quality."

The technology is based on the proven assumption that the web is not fully interconnected. With the possible exception of catch-all sites such as Yahoo!, sites tend to link to their own. Golf sites are unlikely to link to football sites, but they will link to other golf sites. The more golf sites that link to golf site X, the more likely it is to be an authority. The more golf sites are linked from site Y, the more likely that site is a golf hub.

The only previous project involving this type of system of sorting information was IBM Corp's "Clever" project. According to Gardi, this project failed and the results were never published. "The problem with [Clever] was that it took weeks to process the community for one word," he said. "200 million words would need a server farm the size of New York state. The sheer volume processing offline was just not practical. We're doing in real time."

©ComputerWire.com. All rights reserved.

High performance access to file storage

More from The Register

next story
Audio fans, prepare yourself for the Second Coming ... of Blu-ray
High Fidelity Pure Audio – is this what your ears have been waiting for?
MtGox chief Karpelès refuses to come to US for g-men's grilling
Bitcoin baron says he needs another lawyer for FinCEN chat
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Did a date calculation bug just cost hard-up Co-op Bank £110m?
And just when Brit banking org needs £400m to stay afloat
Zucker punched: Google gobbles Facebook-wooed Titan Aerospace
Up, up and away in my beautiful balloon flying broadband-bot
Apple DOMINATES the Valley, rakes in more profit than Google, HP, Intel, Cisco COMBINED
Cook & Co. also pay more taxes than those four worthies PLUS eBay and Oracle
It may be ILLEGAL to run Heartbleed health checks – IT lawyer
Do the right thing, earn up to 10 years in clink
France bans managers from contacting workers outside business hours
«Email? Mais non ... il est plus tard que six heures du soir!»
prev story

Whitepapers

Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
HP ArcSight ESM solution helps Finansbank
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.