How the Yahoo! homepage predicts your clicks
We! Know! You! Want! To! Click! Here!
In the summer of 2008, at an artificial intelligence confab deep in Silicon Valley, Yahoo! senior research scientist Deepak Agarwal revealed that the web giant was using automated algorithms to select news stories on its famous front page. These algorithms, he said, had boosted click-through-rates by 25 to 30 per cent, driving millions of additional dollars in ad revenue.
When we approached Agarwal after his presentation to discuss the new technology and identified ourselves as The Register, he promptly buttoned his lip. We could only hope he would have done the same with The Wall Street Journal or The New York Times or Mom and Pop's Shoestring Guide to All Things Artificial Intelligence, and three years later, our hope is still alive.
This week, Raghu Ramakrishnan – Yahoo!'s chief scientist for search and cloud platforms – sat down with The Register to explain the technology in detail, boasting that click-through rates have now risen more than 270 per cent on the "Today" news module at the heart of the Yahoo! home page. Known as CORE – short for Content Optimization and Relevance Engine – the system doesn't replace human editors. It works alongside them, making many but not all the decisions, and at same time, it feeds editors information that can inform their very human thinking.
"To be honest, the entire increase can't be attributed just to the algorithmic aspects. A big part of the increase is due to editors working much more effectively, with some of the interactive data and real-time feedback that they're getting," Ramakrishnan tells The Register. Yes, The Register. "That was one of the key decisions we made early on. We wouldn't try to replace editors. There would be certain things that always came from them."
When Ramakrishnan arrived at Yahoo! in the summer of 2006 – after nearly twenty years as professor of computer sciences at the University of Wisconsin-Madison – the Yahoo! homepage was arranged almost entirely by human editors. But he was soon approached by two other Yahoo! bigwigs – executive vice president Jeff Weiner and his engineering counterpart Venkat Panchapakesan – with the idea of moving to a more automated setup.
This gave rise to a Yahoo! Research project dubbed the Content Optimization Knowledge Engine – COKE, for short – and those behind the project were affectionately known as Cokeheads. With this system, Deepak Agarwal said at the time, human editors still chose the pool of stories that were eligible for the Today module, but then automated algorithms decided which stories got placed where – and for how long.
"The goal was to use data mining and machine learning to optimize the content shown to users on web portals," Agarwal said.
The system was first tested in late 2007 or early 2008, and it was soon selecting Today module stories for all Yahoo! users. According to Ramakrishnan, it immediately boosted click-through-rates 40 per cent. That's a little higher than the figure Agarwal gave three years ago. But they seem to be in agreement that the effect was rather significant. "I still recall Venkat [Panchapakesan] accusing me of sandbagging things," Ramakrishnan says.
Before the system launched, Ramakrishnan predicted a 20 per cent boost, so he too was impressed by the initial spike. But it was merely a start. The initial system, Agarwal said, was based on the Kalman algorithm, a filtering method developed in the early 1960s. In essence, COKE determined where stories should be placed by analyzing millions of user clicks on the fly. "We track user responses," Agarwal said, "and then we respond ourselves - in real-time."
Back then, Yahoo! had tried to personalize story placement for individual users, but this didn't have a positive effect on the click-through rate. But since then, the company has settled on a personalization method that makes that initial initial 20 to 40 per cent spike look rather small. In January and then again in March, Yahoo! says, the Today module received over one billion clicks, with the click-through rate rising 270 per cent in the US since the automated setup first debuted.
About a year ago, Yahoo! changed the name of the project, dropping COKE for CORE. This is a tad unfortunate, but the system has grown up, spreading beyond the Today module to other Yahoo! services, including its primary news site. And Ramakrishnan says the company plans to plug into so countless other services.
The system still works to predict clicks for Yahoo! users as a whole. But at the same time, it predicts clicks for individual users or segments of users, leaning on information such as their sex and their age (which users supply when they sign up for a Yahoo! account) or even what browser they use.
Drawing on scads of existing data, when a story enters the system, CORE will generate an a priori estimate of how well a story will perform based on "intrinsic features", meaning what words are in the headline and the body of the story. Then, the system tests stories – in real-time – to get a better idea of how they will perform, and – in a matter of minutes – it uses these tests to adjust the way stories will be presented to everyone.
"We don't do this for every story," Ramakrishnan said. "I want to show a few things to a few people and, based on that, have a good estimate for everything in my pool. And then exploit the ones that are the most promising"
"The game here is that I never actually predict true popularity, but I use parsimonious exploration. We have billions of impressions. A tiny fraction of them I'm willing to spend to explore and get an assessment of how popular a story is likely to be – for a given user or for a given segment of users."
Ramakrishnan says the methods used are similar to the "multi-armed bandit" algorithms used in the world of slot machines, but then go much further. "The difference is that we have an extremely dynamic pool of data. We've essentially developed extensions of this statistical approach," he says. "With these, we can come up with a very accurate estimate with how popular a story is likely to be – across the entire population; across a segment, such as males 40 years old and older; with [a particular user]; with [a particular user] while they're having their morning coffee."
But even as the system exploits this information on the fly, it feeds data to a web-based dashboard used by Yahoo!'s human editors. These editors use this information to tweak the system's overall "business rules" and choose the pool of stories from which the system chooses. They also have the power to manually override the system at any time.
The system, Yahoo! says, helps editors create over 13 million different combination of stories on the homepage each day – or 45,000 variations every five minutes. In that time, CORE processes 100GB of user feedback, including clicks, comments, Facebookian "likes", and links from other sites. The result is a system that reaches a middle ground between the human and the inhuman.
The system knows that women generally favor stories about Brad Pitt, but after some real-time analysis, it can quickly realize that men are far more like to click on a Brad Pitt story that involves a sports movie. It can realize that aging Baby Boomers enjoying reading about Justin Beiber as much as the teenage set.
On some level, Ramakrishnan says, people are predictable. But on another, they are not. CORE tries to predict the unpredictable. And if statistics are a reliable judge, it has some success. ®
skipped the entire article so I could write this...
Who the hell still uses Yahoo?
No-one's done this bit yet?
"...it predicts clicks [for users based on] what browser they use."
I can do that:
Opera users: ( well, we're to smart to fall for that ; P )
FireFox users: prefer anything with "Open Source" in it,
Safari users: go for anything with a picture of a partially-eaten fruit in it,
Chrome users: click on anything that asks for your personal information,
and [all together now]
IE users: Click On ALL The Malware Links!
Their homepage click prediction algorithm isn't that complicated in my case
I always go straight into Classic Mail and ignore all the bollocks.
"But at the same time, it predicts clicks for individual users or segments of users, leaning on information such as their sex and their age (which users supply when they sign up for a Yahoo! account"
You mean there are people who don't fill those fields up with rubbish?