How the Yahoo! homepage predicts your clicks

We! Know! You! Want! To! Click! Here!

High performance access to file storage

In the summer of 2008, at an artificial intelligence confab deep in Silicon Valley, Yahoo! senior research scientist Deepak Agarwal revealed that the web giant was using automated algorithms to select news stories on its famous front page. These algorithms, he said, had boosted click-through-rates by 25 to 30 per cent, driving millions of additional dollars in ad revenue.

When we approached Agarwal after his presentation to discuss the new technology and identified ourselves as The Register, he promptly buttoned his lip. We could only hope he would have done the same with The Wall Street Journal or The New York Times or Mom and Pop's Shoestring Guide to All Things Artificial Intelligence, and three years later, our hope is still alive.

This week, Raghu Ramakrishnan – Yahoo!'s chief scientist for search and cloud platforms – sat down with The Register to explain the technology in detail, boasting that click-through rates have now risen more than 270 per cent on the "Today" news module at the heart of the Yahoo! home page. Known as CORE – short for Content Optimization and Relevance Engine – the system doesn't replace human editors. It works alongside them, making many but not all the decisions, and at same time, it feeds editors information that can inform their very human thinking.

"To be honest, the entire increase can't be attributed just to the algorithmic aspects. A big part of the increase is due to editors working much more effectively, with some of the interactive data and real-time feedback that they're getting," Ramakrishnan tells The Register. Yes, The Register. "That was one of the key decisions we made early on. We wouldn't try to replace editors. There would be certain things that always came from them."

When Ramakrishnan arrived at Yahoo! in the summer of 2006 – after nearly twenty years as professor of computer sciences at the University of Wisconsin-Madison – the Yahoo! homepage was arranged almost entirely by human editors. But he was soon approached by two other Yahoo! bigwigs – executive vice president Jeff Weiner and his engineering counterpart Venkat Panchapakesan – with the idea of moving to a more automated setup.

This gave rise to a Yahoo! Research project dubbed the Content Optimization Knowledge Engine – COKE, for short – and those behind the project were affectionately known as Cokeheads. With this system, Deepak Agarwal said at the time, human editors still chose the pool of stories that were eligible for the Today module, but then automated algorithms decided which stories got placed where – and for how long.

"The goal was to use data mining and machine learning to optimize the content shown to users on web portals," Agarwal said.

The system was first tested in late 2007 or early 2008, and it was soon selecting Today module stories for all Yahoo! users. According to Ramakrishnan, it immediately boosted click-through-rates 40 per cent. That's a little higher than the figure Agarwal gave three years ago. But they seem to be in agreement that the effect was rather significant. "I still recall Venkat [Panchapakesan] accusing me of sandbagging things," Ramakrishnan says.

Before the system launched, Ramakrishnan predicted a 20 per cent boost, so he too was impressed by the initial spike. But it was merely a start. The initial system, Agarwal said, was based on the Kalman algorithm, a filtering method developed in the early 1960s. In essence, COKE determined where stories should be placed by analyzing millions of user clicks on the fly. "We track user responses," Agarwal said, "and then we respond ourselves - in real-time."

Back then, Yahoo! had tried to personalize story placement for individual users, but this didn't have a positive effect on the click-through rate. But since then, the company has settled on a personalization method that makes that initial initial 20 to 40 per cent spike look rather small. In January and then again in March, Yahoo! says, the Today module received over one billion clicks, with the click-through rate rising 270 per cent in the US since the automated setup first debuted.

About a year ago, Yahoo! changed the name of the project, dropping COKE for CORE. This is a tad unfortunate, but the system has grown up, spreading beyond the Today module to other Yahoo! services, including its primary news site. And Ramakrishnan says the company plans to plug into so countless other services.

The system still works to predict clicks for Yahoo! users as a whole. But at the same time, it predicts clicks for individual users or segments of users, leaning on information such as their sex and their age (which users supply when they sign up for a Yahoo! account) or even what browser they use.

Drawing on scads of existing data, when a story enters the system, CORE will generate an a priori estimate of how well a story will perform based on "intrinsic features", meaning what words are in the headline and the body of the story. Then, the system tests stories – in real-time – to get a better idea of how they will perform, and – in a matter of minutes – it uses these tests to adjust the way stories will be presented to everyone.

"We don't do this for every story," Ramakrishnan said. "I want to show a few things to a few people and, based on that, have a good estimate for everything in my pool. And then exploit the ones that are the most promising"

"The game here is that I never actually predict true popularity, but I use parsimonious exploration. We have billions of impressions. A tiny fraction of them I'm willing to spend to explore and get an assessment of how popular a story is likely to be – for a given user or for a given segment of users."

Ramakrishnan says the methods used are similar to the "multi-armed bandit" algorithms used in the world of slot machines, but then go much further. "The difference is that we have an extremely dynamic pool of data. We've essentially developed extensions of this statistical approach," he says. "With these, we can come up with a very accurate estimate with how popular a story is likely to be – across the entire population; across a segment, such as males 40 years old and older; with [a particular user]; with [a particular user] while they're having their morning coffee."

But even as the system exploits this information on the fly, it feeds data to a web-based dashboard used by Yahoo!'s human editors. These editors use this information to tweak the system's overall "business rules" and choose the pool of stories from which the system chooses. They also have the power to manually override the system at any time.

The system, Yahoo! says, helps editors create over 13 million different combination of stories on the homepage each day – or 45,000 variations every five minutes. In that time, CORE processes 100GB of user feedback, including clicks, comments, Facebookian "likes", and links from other sites. The result is a system that reaches a middle ground between the human and the inhuman.

The system knows that women generally favor stories about Brad Pitt, but after some real-time analysis, it can quickly realize that men are far more like to click on a Brad Pitt story that involves a sports movie. It can realize that aging Baby Boomers enjoying reading about Justin Beiber as much as the teenage set.

On some level, Ramakrishnan says, people are predictable. But on another, they are not. CORE tries to predict the unpredictable. And if statistics are a reliable judge, it has some success. ®

High performance access to file storage

More from The Register

next story
Audio fans, prepare yourself for the Second Coming ... of Blu-ray
High Fidelity Pure Audio – is this what your ears have been waiting for?
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
MtGox chief Karpelès refuses to come to US for g-men's grilling
Bitcoin baron says he needs another lawyer for FinCEN chat
Did a date calculation bug just cost hard-up Co-op Bank £110m?
And just when Brit banking org needs £400m to stay afloat
Zucker punched: Google gobbles Facebook-wooed Titan Aerospace
Up, up and away in my beautiful balloon flying broadband-bot
Apple DOMINATES the Valley, rakes in more profit than Google, HP, Intel, Cisco COMBINED
Cook & Co. also pay more taxes than those four worthies PLUS eBay and Oracle
It may be ILLEGAL to run Heartbleed health checks – IT lawyer
Do the right thing, earn up to 10 years in clink
France bans managers from contacting workers outside business hours
«Email? Mais non ... il est plus tard que six heures du soir!»
prev story


Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
HP ArcSight ESM solution helps Finansbank
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.