Feeds

Bing finds meaning in Powerset

Well, at least a little

Security for virtualized datacentres

Powerset's semantic obsession is already working its way into Bing's primary search engine, helping to suss out the meaning behind end-user queries, generate captions for query results, and suggest related queries.

Microsoft acquired the San Francisco-based Powerset last summer in a deal worth a reported $100m, nearly a year before Bing's much-ballyhooed debut. At the time, the startup offered a semantic search engine that indexed nothing but Wikipedia, and this Wikicontraption was eventually bolted to the side of Bing's primary search engine and rechristened as a "Reference" vertical.

But the ultimate goal is to meld Powerset's semantic indexing with Bing proper, and according to Scott Prevost, who oversees the Powerset's interplay with Redmond, the melding is well underway.

"We're taking pieces of our technology and integrating it throughout the Bing stack," Prevost tells The Reg. "So things like helping on some of the query processing. And we're now working on some of the caption generation - the text that occurs under the blue link on search results. This is part of a longer-term, deeper integration of our technologies throughout all of Bing."

He also says the outfit is "doing some work with related searches" - i.e. helping to suggest additional queries the user may be interested in.

Still based in San Francisco - several hundred miles away from Bing's Redmond base - the 65-person-strong Powerset is "diving very deeply" into the task of caption generation. "It's one of the things that helps users understand the relevance of a particular search result to their query," Prevost says. "If you have good captions, it helps users not waste time looking through pages.

"One of the challenges in developing captions is finding the right pieces of text on a page to represent that link, so semantic processing really helps. It helps pick the right sentences, sentences that may have the right concepts but not necessarily the keywords from [the user's query]. It helps us pick the piece of the sentence that's most relevant and not chop it off in places that makes it unreadable...

"You see things in Powerset captions such as whole phrases being highlighted, phrases where the words don't match all the keywords but the meaning of the words matches. Sometimes, you get a great sentence in an article and it doesn't have all the keywords but it's really the thing that best explains what the sentence is about."

Using its own back-end infrastructure, Powerset works to build a semantic index for at least a portion of the web. "When we index a document, we do much heavier processing," Prevost explains. "We do deep linguistic processing, everything from morphological analysis - scanning the words for our speech patterns - to full-on syntactic parsing of sentences.

"Then we have a component that extracts semantic relationships from those parses." For instance, the outfit's proprietary tech works to recognize synonyms or associated generic pronouns with particular names. Then, after doing a similar analysis on an end-user query, Powerset can match semantic data between query and index.

Yes, Powerset's back-end runs on Hadoop, the open-source distributed-computing platform based on Google's proprietary infrastructure. Powerset originated Hadoop's Hbase project, a mirror of Google's distributed database, BigTable. And yes, that means open-source code is juicing at least a portion of Bing proper. "What we provide Bing with is data, and data can be produced using various open-source tools in Powerset's data center," Prevost says.

Famously, Microsoft spent years treating open source like a pariah, and even now it seems that relatively few of the company's shipping products embrace open code. But according to Prevost, Microsoft was always open to the idea of retaining Powerset's Hbase base.

"We obviously had a lot of conversations [with Microsoft] about what we were doing and why it was important," Prevost says. "Microsoft was very open to the idea of open source. Obviously, Microsoft has a lot of IP concerns with software in so many different domains, so they want to be very careful about these things...but it was really just a matter of working out the details."

After the acquisition, while these conversations played out, Powerset's two full-time Hbase committers took leave from the project. But by October, they were approved to resume contributing patches.

As you might expect, Microsoft has no plans to migrate Bing proper onto the platform. "We haven't done anything to the Bing code base that explicitly uses Hbase," Prevost says.

But whether it's underpinned by Hadoop or not, Powerset intends to build a semantic index for the entire web. It just needs some time - and some cheaper, faster processing power. "Where we are right now is that it's still very expensive. We spend a lot more time indexing a page and that takes a lot more processing power. And that creates a much larger index, which is more expensive to serve. It wouldn't make sense for us to index the entire web, because it would be highly expensive, and for certain kinds of pages, we might not see the value."

So, for the moment, Powerset is indexing Wikipedia. But there's more to come. It may add other, contained datasets to Bing's Reference vertical, before attempting to embrace the web as whole. And yes, it will take that Reference tab out of hiding. As it stands, Powerset's Wikisearch is limited to a relatively small number of queries, including the search for "Albert Einstein."

Bootnote

How does Powerset avoid Wikinonsense? According to Prevost, it re-indexes the "free encyclopedia anyone can edit" every two hours or so. "We look for changes and re-index those articles," Prevost explains. "That helps to make sure we don't have pages that are vandalized...the vandalized pages get fixed pretty quickly." Or so it seems.

Choosing a cloud hosting partner with confidence

More from The Register

next story
Ex-US Navy fighter pilot MIT prof: Drones beat humans - I should know
'Missy' Cummings on UAVs, smartcars and dying from boredom
Facebook, Apple: LADIES! Why not FREEZE your EGGS? It's on the company!
No biological clockwatching when you work in Silicon Valley
The 'fun-nification' of computer education – good idea?
Compulsory code schools, luvvies love it, but what about Maths and Physics?
Doctor Who's Flatline: Cool monsters, yes, but utterly limp subplots
We know what the Doctor does, stop going on about it already
'Cowardly, venomous trolls' threatened with TWO-YEAR sentences for menacing posts
UK government: 'Taking a stand against a baying cyber-mob'
Happiness economics is bollocks. Oh, UK.gov just adopted it? Er ...
Opportunity doesn't knock; it costs us instead
Sysadmin with EBOLA? Gartner's issued advice to debug your biz
Start hoarding cleaning supplies, analyst firm says, and assume your team will scatter
Don't bother telling people if you lose their data, say Euro bods
You read that right – with the proviso that it's encrypted
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.