Feeds

Bing finds meaning in Powerset

Well, at least a little

High performance access to file storage

Powerset's semantic obsession is already working its way into Bing's primary search engine, helping to suss out the meaning behind end-user queries, generate captions for query results, and suggest related queries.

Microsoft acquired the San Francisco-based Powerset last summer in a deal worth a reported $100m, nearly a year before Bing's much-ballyhooed debut. At the time, the startup offered a semantic search engine that indexed nothing but Wikipedia, and this Wikicontraption was eventually bolted to the side of Bing's primary search engine and rechristened as a "Reference" vertical.

But the ultimate goal is to meld Powerset's semantic indexing with Bing proper, and according to Scott Prevost, who oversees the Powerset's interplay with Redmond, the melding is well underway.

"We're taking pieces of our technology and integrating it throughout the Bing stack," Prevost tells The Reg. "So things like helping on some of the query processing. And we're now working on some of the caption generation - the text that occurs under the blue link on search results. This is part of a longer-term, deeper integration of our technologies throughout all of Bing."

He also says the outfit is "doing some work with related searches" - i.e. helping to suggest additional queries the user may be interested in.

Still based in San Francisco - several hundred miles away from Bing's Redmond base - the 65-person-strong Powerset is "diving very deeply" into the task of caption generation. "It's one of the things that helps users understand the relevance of a particular search result to their query," Prevost says. "If you have good captions, it helps users not waste time looking through pages.

"One of the challenges in developing captions is finding the right pieces of text on a page to represent that link, so semantic processing really helps. It helps pick the right sentences, sentences that may have the right concepts but not necessarily the keywords from [the user's query]. It helps us pick the piece of the sentence that's most relevant and not chop it off in places that makes it unreadable...

"You see things in Powerset captions such as whole phrases being highlighted, phrases where the words don't match all the keywords but the meaning of the words matches. Sometimes, you get a great sentence in an article and it doesn't have all the keywords but it's really the thing that best explains what the sentence is about."

Using its own back-end infrastructure, Powerset works to build a semantic index for at least a portion of the web. "When we index a document, we do much heavier processing," Prevost explains. "We do deep linguistic processing, everything from morphological analysis - scanning the words for our speech patterns - to full-on syntactic parsing of sentences.

"Then we have a component that extracts semantic relationships from those parses." For instance, the outfit's proprietary tech works to recognize synonyms or associated generic pronouns with particular names. Then, after doing a similar analysis on an end-user query, Powerset can match semantic data between query and index.

Yes, Powerset's back-end runs on Hadoop, the open-source distributed-computing platform based on Google's proprietary infrastructure. Powerset originated Hadoop's Hbase project, a mirror of Google's distributed database, BigTable. And yes, that means open-source code is juicing at least a portion of Bing proper. "What we provide Bing with is data, and data can be produced using various open-source tools in Powerset's data center," Prevost says.

Famously, Microsoft spent years treating open source like a pariah, and even now it seems that relatively few of the company's shipping products embrace open code. But according to Prevost, Microsoft was always open to the idea of retaining Powerset's Hbase base.

"We obviously had a lot of conversations [with Microsoft] about what we were doing and why it was important," Prevost says. "Microsoft was very open to the idea of open source. Obviously, Microsoft has a lot of IP concerns with software in so many different domains, so they want to be very careful about these things...but it was really just a matter of working out the details."

After the acquisition, while these conversations played out, Powerset's two full-time Hbase committers took leave from the project. But by October, they were approved to resume contributing patches.

As you might expect, Microsoft has no plans to migrate Bing proper onto the platform. "We haven't done anything to the Bing code base that explicitly uses Hbase," Prevost says.

But whether it's underpinned by Hadoop or not, Powerset intends to build a semantic index for the entire web. It just needs some time - and some cheaper, faster processing power. "Where we are right now is that it's still very expensive. We spend a lot more time indexing a page and that takes a lot more processing power. And that creates a much larger index, which is more expensive to serve. It wouldn't make sense for us to index the entire web, because it would be highly expensive, and for certain kinds of pages, we might not see the value."

So, for the moment, Powerset is indexing Wikipedia. But there's more to come. It may add other, contained datasets to Bing's Reference vertical, before attempting to embrace the web as whole. And yes, it will take that Reference tab out of hiding. As it stands, Powerset's Wikisearch is limited to a relatively small number of queries, including the search for "Albert Einstein."

Bootnote

How does Powerset avoid Wikinonsense? According to Prevost, it re-indexes the "free encyclopedia anyone can edit" every two hours or so. "We look for changes and re-index those articles," Prevost explains. "That helps to make sure we don't have pages that are vandalized...the vandalized pages get fixed pretty quickly." Or so it seems.

High performance access to file storage

More from The Register

next story
Audio fans, prepare yourself for the Second Coming ... of Blu-ray
High Fidelity Pure Audio – is this what your ears have been waiting for?
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Did a date calculation bug just cost hard-up Co-op Bank £110m?
And just when Brit banking org needs £400m to stay afloat
MtGox chief Karpelès refuses to come to US for g-men's grilling
Bitcoin baron says he needs another lawyer for FinCEN chat
Zucker punched: Google gobbles Facebook-wooed Titan Aerospace
Up, up and away in my beautiful balloon flying broadband-bot
Apple DOMINATES the Valley, rakes in more profit than Google, HP, Intel, Cisco COMBINED
Cook & Co. also pay more taxes than those four worthies PLUS eBay and Oracle
It may be ILLEGAL to run Heartbleed health checks – IT lawyer
Do the right thing, earn up to 10 years in clink
France bans managers from contacting workers outside business hours
«Email? Mais non ... il est plus tard que six heures du soir!»
Adrian Mole author Sue Townsend dies at 68
RIP Blighty's best-selling author of the 1980s
prev story

Whitepapers

Mainstay ROI - Does application security pay?
In this whitepaper learn how you and your enterprise might benefit from better software security.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.