Feeds

Bing finds meaning in Powerset

Well, at least a little

New hybrid storage solutions

Powerset's semantic obsession is already working its way into Bing's primary search engine, helping to suss out the meaning behind end-user queries, generate captions for query results, and suggest related queries.

Microsoft acquired the San Francisco-based Powerset last summer in a deal worth a reported $100m, nearly a year before Bing's much-ballyhooed debut. At the time, the startup offered a semantic search engine that indexed nothing but Wikipedia, and this Wikicontraption was eventually bolted to the side of Bing's primary search engine and rechristened as a "Reference" vertical.

But the ultimate goal is to meld Powerset's semantic indexing with Bing proper, and according to Scott Prevost, who oversees the Powerset's interplay with Redmond, the melding is well underway.

"We're taking pieces of our technology and integrating it throughout the Bing stack," Prevost tells The Reg. "So things like helping on some of the query processing. And we're now working on some of the caption generation - the text that occurs under the blue link on search results. This is part of a longer-term, deeper integration of our technologies throughout all of Bing."

He also says the outfit is "doing some work with related searches" - i.e. helping to suggest additional queries the user may be interested in.

Still based in San Francisco - several hundred miles away from Bing's Redmond base - the 65-person-strong Powerset is "diving very deeply" into the task of caption generation. "It's one of the things that helps users understand the relevance of a particular search result to their query," Prevost says. "If you have good captions, it helps users not waste time looking through pages.

"One of the challenges in developing captions is finding the right pieces of text on a page to represent that link, so semantic processing really helps. It helps pick the right sentences, sentences that may have the right concepts but not necessarily the keywords from [the user's query]. It helps us pick the piece of the sentence that's most relevant and not chop it off in places that makes it unreadable...

"You see things in Powerset captions such as whole phrases being highlighted, phrases where the words don't match all the keywords but the meaning of the words matches. Sometimes, you get a great sentence in an article and it doesn't have all the keywords but it's really the thing that best explains what the sentence is about."

Using its own back-end infrastructure, Powerset works to build a semantic index for at least a portion of the web. "When we index a document, we do much heavier processing," Prevost explains. "We do deep linguistic processing, everything from morphological analysis - scanning the words for our speech patterns - to full-on syntactic parsing of sentences.

"Then we have a component that extracts semantic relationships from those parses." For instance, the outfit's proprietary tech works to recognize synonyms or associated generic pronouns with particular names. Then, after doing a similar analysis on an end-user query, Powerset can match semantic data between query and index.

Yes, Powerset's back-end runs on Hadoop, the open-source distributed-computing platform based on Google's proprietary infrastructure. Powerset originated Hadoop's Hbase project, a mirror of Google's distributed database, BigTable. And yes, that means open-source code is juicing at least a portion of Bing proper. "What we provide Bing with is data, and data can be produced using various open-source tools in Powerset's data center," Prevost says.

Famously, Microsoft spent years treating open source like a pariah, and even now it seems that relatively few of the company's shipping products embrace open code. But according to Prevost, Microsoft was always open to the idea of retaining Powerset's Hbase base.

"We obviously had a lot of conversations [with Microsoft] about what we were doing and why it was important," Prevost says. "Microsoft was very open to the idea of open source. Obviously, Microsoft has a lot of IP concerns with software in so many different domains, so they want to be very careful about these things...but it was really just a matter of working out the details."

After the acquisition, while these conversations played out, Powerset's two full-time Hbase committers took leave from the project. But by October, they were approved to resume contributing patches.

As you might expect, Microsoft has no plans to migrate Bing proper onto the platform. "We haven't done anything to the Bing code base that explicitly uses Hbase," Prevost says.

But whether it's underpinned by Hadoop or not, Powerset intends to build a semantic index for the entire web. It just needs some time - and some cheaper, faster processing power. "Where we are right now is that it's still very expensive. We spend a lot more time indexing a page and that takes a lot more processing power. And that creates a much larger index, which is more expensive to serve. It wouldn't make sense for us to index the entire web, because it would be highly expensive, and for certain kinds of pages, we might not see the value."

So, for the moment, Powerset is indexing Wikipedia. But there's more to come. It may add other, contained datasets to Bing's Reference vertical, before attempting to embrace the web as whole. And yes, it will take that Reference tab out of hiding. As it stands, Powerset's Wikisearch is limited to a relatively small number of queries, including the search for "Albert Einstein."

Bootnote

How does Powerset avoid Wikinonsense? According to Prevost, it re-indexes the "free encyclopedia anyone can edit" every two hours or so. "We look for changes and re-index those articles," Prevost explains. "That helps to make sure we don't have pages that are vandalized...the vandalized pages get fixed pretty quickly." Or so it seems.

Security for virtualized datacentres

More from The Register

next story
Phones 4u slips into administration after EE cuts ties with Brit mobe retailer
More than 5,500 jobs could be axed if rescue mission fails
JINGS! Microsoft Bing called Scots indyref RIGHT!
Redmond sporran metrics get one in the ten ring
Driving with an Apple Watch could land you with a £100 FINE
Bad news for tech-addicted fanbois behind the wheel
Phones 4u website DIES as wounded mobe retailer struggles to stay above water
Founder blames 'ruthless network partners' for implosion
Sony says year's losses will be FOUR TIMES DEEPER than thought
Losses of more than $2 BILLION loom over troubled Japanese corp
Radio hams can encrypt, in emergencies, says Ofcom
Consultation promises new spectrum and hints at relaxed licence conditions
Why Oracle CEO Larry Ellison had to go ... Except he hasn't
Silicon Valley's veteran seadog in piratical Putin impression
Big Content Australia just blew a big hole in its credibility
AHEDA's research on average content prices did not expose methodology, so appears less than rigourous
prev story

Whitepapers

Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.