Feeds

How the Yahoo! homepage predicts your clicks

We! Know! You! Want! To! Click! Here!

3 Big data security analytics techniques

In the summer of 2008, at an artificial intelligence confab deep in Silicon Valley, Yahoo! senior research scientist Deepak Agarwal revealed that the web giant was using automated algorithms to select news stories on its famous front page. These algorithms, he said, had boosted click-through-rates by 25 to 30 per cent, driving millions of additional dollars in ad revenue.

When we approached Agarwal after his presentation to discuss the new technology and identified ourselves as The Register, he promptly buttoned his lip. We could only hope he would have done the same with The Wall Street Journal or The New York Times or Mom and Pop's Shoestring Guide to All Things Artificial Intelligence, and three years later, our hope is still alive.

This week, Raghu Ramakrishnan – Yahoo!'s chief scientist for search and cloud platforms – sat down with The Register to explain the technology in detail, boasting that click-through rates have now risen more than 270 per cent on the "Today" news module at the heart of the Yahoo! home page. Known as CORE – short for Content Optimization and Relevance Engine – the system doesn't replace human editors. It works alongside them, making many but not all the decisions, and at same time, it feeds editors information that can inform their very human thinking.

"To be honest, the entire increase can't be attributed just to the algorithmic aspects. A big part of the increase is due to editors working much more effectively, with some of the interactive data and real-time feedback that they're getting," Ramakrishnan tells The Register. Yes, The Register. "That was one of the key decisions we made early on. We wouldn't try to replace editors. There would be certain things that always came from them."

When Ramakrishnan arrived at Yahoo! in the summer of 2006 – after nearly twenty years as professor of computer sciences at the University of Wisconsin-Madison – the Yahoo! homepage was arranged almost entirely by human editors. But he was soon approached by two other Yahoo! bigwigs – executive vice president Jeff Weiner and his engineering counterpart Venkat Panchapakesan – with the idea of moving to a more automated setup.

This gave rise to a Yahoo! Research project dubbed the Content Optimization Knowledge Engine – COKE, for short – and those behind the project were affectionately known as Cokeheads. With this system, Deepak Agarwal said at the time, human editors still chose the pool of stories that were eligible for the Today module, but then automated algorithms decided which stories got placed where – and for how long.

"The goal was to use data mining and machine learning to optimize the content shown to users on web portals," Agarwal said.

The system was first tested in late 2007 or early 2008, and it was soon selecting Today module stories for all Yahoo! users. According to Ramakrishnan, it immediately boosted click-through-rates 40 per cent. That's a little higher than the figure Agarwal gave three years ago. But they seem to be in agreement that the effect was rather significant. "I still recall Venkat [Panchapakesan] accusing me of sandbagging things," Ramakrishnan says.

Before the system launched, Ramakrishnan predicted a 20 per cent boost, so he too was impressed by the initial spike. But it was merely a start. The initial system, Agarwal said, was based on the Kalman algorithm, a filtering method developed in the early 1960s. In essence, COKE determined where stories should be placed by analyzing millions of user clicks on the fly. "We track user responses," Agarwal said, "and then we respond ourselves - in real-time."

Back then, Yahoo! had tried to personalize story placement for individual users, but this didn't have a positive effect on the click-through rate. But since then, the company has settled on a personalization method that makes that initial initial 20 to 40 per cent spike look rather small. In January and then again in March, Yahoo! says, the Today module received over one billion clicks, with the click-through rate rising 270 per cent in the US since the automated setup first debuted.

About a year ago, Yahoo! changed the name of the project, dropping COKE for CORE. This is a tad unfortunate, but the system has grown up, spreading beyond the Today module to other Yahoo! services, including its primary news site. And Ramakrishnan says the company plans to plug into so countless other services.

The system still works to predict clicks for Yahoo! users as a whole. But at the same time, it predicts clicks for individual users or segments of users, leaning on information such as their sex and their age (which users supply when they sign up for a Yahoo! account) or even what browser they use.

Drawing on scads of existing data, when a story enters the system, CORE will generate an a priori estimate of how well a story will perform based on "intrinsic features", meaning what words are in the headline and the body of the story. Then, the system tests stories – in real-time – to get a better idea of how they will perform, and – in a matter of minutes – it uses these tests to adjust the way stories will be presented to everyone.

"We don't do this for every story," Ramakrishnan said. "I want to show a few things to a few people and, based on that, have a good estimate for everything in my pool. And then exploit the ones that are the most promising"

"The game here is that I never actually predict true popularity, but I use parsimonious exploration. We have billions of impressions. A tiny fraction of them I'm willing to spend to explore and get an assessment of how popular a story is likely to be – for a given user or for a given segment of users."

Ramakrishnan says the methods used are similar to the "multi-armed bandit" algorithms used in the world of slot machines, but then go much further. "The difference is that we have an extremely dynamic pool of data. We've essentially developed extensions of this statistical approach," he says. "With these, we can come up with a very accurate estimate with how popular a story is likely to be – across the entire population; across a segment, such as males 40 years old and older; with [a particular user]; with [a particular user] while they're having their morning coffee."

But even as the system exploits this information on the fly, it feeds data to a web-based dashboard used by Yahoo!'s human editors. These editors use this information to tweak the system's overall "business rules" and choose the pool of stories from which the system chooses. They also have the power to manually override the system at any time.

The system, Yahoo! says, helps editors create over 13 million different combination of stories on the homepage each day – or 45,000 variations every five minutes. In that time, CORE processes 100GB of user feedback, including clicks, comments, Facebookian "likes", and links from other sites. The result is a system that reaches a middle ground between the human and the inhuman.

The system knows that women generally favor stories about Brad Pitt, but after some real-time analysis, it can quickly realize that men are far more like to click on a Brad Pitt story that involves a sports movie. It can realize that aging Baby Boomers enjoying reading about Justin Beiber as much as the teenage set.

On some level, Ramakrishnan says, people are predictable. But on another, they are not. CORE tries to predict the unpredictable. And if statistics are a reliable judge, it has some success. ®

Top three mobile application threats

More from The Register

next story
Audio fans, prepare yourself for the Second Coming ... of Blu-ray
High Fidelity Pure Audio – is this what your ears have been waiting for?
Record labels sue Pandora over vintage song royalties
Companies want payout on recordings made before 1972
Zucker punched: Google gobbles Facebook-wooed Titan Aerospace
Up, up and away in my beautiful balloon flying broadband-bot
Apple DOMINATES the Valley, rakes in more profit than Google, HP, Intel, Cisco COMBINED
Cook & Co. also pay more taxes than those four worthies PLUS eBay and Oracle
Intel sees 'signs of improvement in the PC business' but earnings remain 'Meh...'
Prospects for the future, however, please Wall Street money men
What's a right pain in the ASCII for IBM? Its own leech-like hardware biz
Keep your eyes on our cloud while we remove this pesky thing, say execs
Oracle's Larry Ellison has the MOST MASSIVE PACKAGE IN PUBLIC
Billionaire IT baron earns twice as much as the next in line, Disney chief Bob Iger
prev story

Whitepapers

Mainstay ROI - Does application security pay?
In this whitepaper learn how you and your enterprise might benefit from better software security.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Top three mobile application threats
Learn about three of the top mobile application security threats facing businesses today and recommendations on how to mitigate the risk.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.