Original URL: http://www.theregister.co.uk/2011/08/23/ebay_cassini_search_rewrite/

'The most ambitious project at eBay for a long, long time'

Inside the auction leviathan's search megaproject

By Gavin Clarke

Posted in Cloud, 23rd August 2011 11:00 GMT

"It's one of the most ambitious projects that's been undertaken at eBay for a very, very long time," vice president of search and experience technology Hugh Williams tells us on the phone. "I'd argue it's on the scale of any major search engine re-write a company has done."

That's a meaty claim. eBay is one of the biggest names in tech, a survivor of the late 1990s dot-com bubble which has come to dominate the market for online auctions.

Not only that, but this isn't the first mega-search project Williams has worked on. Two years ago Williams was an actor in possibly the internet's greatest comeback story – the jury's still out - building Microsoft's Bing. Williams was a development manager working on Microsoft's multi-year, multi-million-dollar effort to steal Google's crown as web number 1.

And you don't need to be a search algorithm nerd to appreciate Williams' Bing work: it's right in front of your eyes. He introduced endless scrolling and playing video thumbnails, breaking new ground and winning a patent with others in the process.

Williams' eBay project is a complete rewrite of the piece of software that's been hardwired into eBay's annual $9.2bn business for nearly 10 years: its search engine.

Williams' vision for Cassini is a 'world class' search engine. He wants something more intuitive for eBay's 97 million active users, something that is itself smarter

eBay claims nearly 100 million active users with global search in 41 markets; more than half of all purchases made on eBay by users in those 41 markets start with the simple act of the customer conducting a search query. The rest are prompted to bid through things like email alerts. There are "thousands" of search queries conducted every second on more than 200 million live listings in eBay's inventory – 10 million are added or deleted daily.

Announced in October last year, the search-engine project – codenamed Cassini – is due to enter A/B testing in late summer with plans for full launch in 2012, Williams told me during a recent interview. I spoke to Williams before eBay pressed the button on testing.

Cassini will replace eBay's current search engine, called Voyager – yes eBay's techies have a slight NASA fixation – that's knocking on for 10 years old, and which Williams believes has been surpassed by many advances in the world of search technology in recent years.

Williams' vision for Cassini is a "world class" search engine. He wants something more intuitive for eBay's nearly 97 million active users, something that is itself smarter: something capable of understanding precisely what it is that users want and doesn't shower them with heaps of tenuous results.

eBay also wants something which that is easier and cheaper for its IT department to run and which can be easily extended without the need for another massive reset in 10 years' time.

SPARC-ing out

So big is the Cassini project, it has sucked in a quarter of the 400 eBay engineers under Williams from three of his teams: search back-end, search science, and search front-end. Williams says these are three of eBay's "largest and deepest" technology teams. They are building Cassini using thousands of lines of C++ and Java code, and massively expanding eBay's existing use of the open-source HBase and Hadoop to crunch terabytes of data.

Search currently runs on "thousands" of SPARC servers in eBay distributed data centres around the world but, for added excitement, eBay plans to uproot from SPARC and head for "commodity servers" – industry shorthand for cheaper Intel or AMD x86 hardware.

The enormity of the task is made all the more remarkable when you consider Williams could be considered to be relatively new to leading such big, battle-ready, commercial software projects.

'A pretty applied guy'

The majority of Williams' employment history has actually been in academia: according to his CV he spent 10 years of his career in teaching and research at Australia's School of Computer Science and IT, RMIT University, where he also graduated. That's longer than Voyager has actually been in action at eBay. This is not to detract from Williams' search smarts: he's earned two degrees in computer science at RMIT, and has 101 published works to his name, works that include three books, 20 journals and five patents – including the work on Bing – on programming and/or search. Among his top publications: Compression of inverted indexes for fast query evaluation.

Williams reckons his work at Microsoft has helped on Cassini.

Williams joined eBay in 2009 after four years at Microsoft; he joined Redmond right after RMIT. At Microsoft, Williams was a development manager in charge of 150 software engineers working on Microsoft Live Search and Bing, a project that has received billions of dollars in extra funding in the last half decade from a company coming from a standing stop to catch Google.

'The big thing I took away from Microsoft was how to build commercial search to scale – that was one of my big takeaways' – Hugh Williams

Williams managed the legendary Powerset team bought by Microsoft for $100m in an attempt to add to Bing the ability to understand the full meaning of the phrases that are typed into its search engine in order to deliver results based on that understanding.

It wasn't all Powerset and patents glory at Microsoft, though, as Williams worked on a feature that actually left Microsoft a little red-faced. He helped develop Bing's video search smart motion thumbnails, which let you play a 30-second clip of a video in your search returns when you moused over them. A good idea in a noble world, but in the impure world of reality it became a nifty way to dodge porn filters and grab a saucy 30-second eyeful. Microsoft rushed out a fix so wide-eyed parents could re-assert control over their precious PCs.

I asked Williams to comment on the incident, but he politely declined to comment.

In a relatively short time, then, Williams has landed himself prominent roles leading all-or-nothing search projects for two of the business world's best-known names: Microsoft and eBay.

The Bing thing

Williams reckons he made the transition from academia at RMIT to working on full-blown, industrial-size software search projects because his research work was applied. "I'd say the transition from academic to industry setting wasn't much of a jolt because I've always been a pretty applied guy. I was never a blue-sky academic," he says.

He concedes, though, that while he knew a lot about things like ranking detection, the challenge was the learning curve in taking the ideas behind Cassini and running them on a distributed search system rather than on a single node.

And that's where Bing seems to have helped. "The big thing I took away from Microsoft was how to build commercial search to scale – that was one of my big takeaways," Williams said.

Cassini is adopting systems management through the software layer: from set-up and testing of a server to the monitoring of its health and taking action in the event of problems. It is also building in fault-tolerance and beefing up reliability so that eBay can move off SPARC – famed for its solid performance – onto the ever-so-slightly less renowned, but cheaper, x86.

Search has changed a lot over the past few years, and Voyager has been bypassed, Williams said. "The 1990s version of search was: you bought expensive machines and ran them with humans. The 2010s version is: you buy cheap machines and run them with software, so it all runs as a reliable system."

More data, more Hadoop, more patterns

Cassini exploits Williams' background in finding ways to improve search in order to hugely expand eBay's use of Hadoop to search. The secret to success is not a crushingly hyped idea like natural-language query or committing yourself to some mystical algorithm alchemy. Rather it is making use of the data you have and will continue to amass on users' searches and search behavior – mining it to see what customers want. Doing this will make Cassini more intuitive than Voyager at working out what it is eBay's customers want, Williams reckons.

He gives the example of searching for the Snowboard Kids game for the Wii. Type "snowboard kids" into eBay today and, yes, you'll get the game, but you'll also get snowboards, goggles, boots, gloves, jackets and much, much more. For kids.

"Voyager doesn't understand the past behaviors of users and intent behind users queries," Williams said.

Cassini will mine data from eBay's 97 million active users using Hadoop in a massively parallel and distributed architecture in order to rank different items. Data patterns will be identified by crunching information on – among other things – corrections made by users to searches, contracted acronyms, expanded acronyms and words that are in different languages.

'We will throw more data at [Hadoop] – more data and mining of that data to create richer tasks' – Hugh Williams

"Understanding the user intent is a data rather than an algorithm task," Williams asserts. "We have been using Hadoop for a while – it has been around for two years. We haven't used it extensively for the things we are talking about now, but it was a component in the Voyager system. We will throw more data at it – more data and mining of that data to create richer tasks."

Hadoop is the open-source architecture inspired by Google's MapReduce – and initially championed by Yahoo! – to process huge sets of data by harnessing the power of large numbers of clustered servers. Hadoop's code is available under an Apache Software Foundation licence but it has received commercial support from Cloudera, the start-up that is home to Hadoop founder Doug Cutting. In June Yahoo! spun out the remaining members of its Hadoop engineering team to create Horton Works, backed by the venture capitalist Rob Bearden from Benchmark Capital. Today Hadoop is used by Facebook and Twitter among other web-scale giants besides eBay.

At eBay, Hadoop is used by the search science team. William's engineers are working a Hadoop engineering team and a Hadoop product team elsewhere at eBay and working on eBay's implementation. The engineering team works on the changes to Hadoop and Hbase, the Hadoop database modelled on Google's BigTable for distributed storage, which eBay also uses. The team refines things like scheduling and makes sure the right jobs have the right priorities on eBay's Hadoop set-up, while delivering as much concurrency as possible across the thousands of servers running Hadoop.

The product team, meanwhile, deals with what comes out of the Hadoop changes that the first group have built. They clean the data spat out, manage the grid, and work with the 41 eBay marketplaces that rely on the search service and, by extension, Hadoop.

Real-time challenge

The search science team is working with the other two groups under Williams' control: the search back-end and search front-end teams. The back-end team is taking items from customers; processing and tagging them; constructing a product and shipping index; working with the caching layer; and handling updates. Twenty per cent of the goods in eBay leave the system each day, making this full-time process. "It's probably the most challenging real-time search environment," Williams said.

The front-end team forms works on presenting the search results. The interface is built using mostly Java with some HTML, CSS and Javascript.

Williams says that Cassini is one of the most ambitious projects he has been involved with, but at least he has the Bing project in his back pocket. "It's technically very hard in the search world and it's an engineering challenge because it involves so many people," Williams says. ®