Tilera routs Intel, AMD in Facebook bakeoff
Social Memcached marketing
Facebook may not think of itself as a social marketing company, but for upstart server-chip maker Tilera, the social media giant's internal Memcached bakeoff pitting Xeon and Opteron machines against Tilera boxes is a marketing windfall, indeed.
Facebook's Memcached performance paper, being presented at the International Green Computing Conference in Orlando, Florida, details how Facebook tested the mettle (surely metal?) of the current generation of TilePro64 many-cored processors against off-the-shelf servers using Intel Xeon and AMD Opteron processors.
Tilera, SeaMicro, and Calxeda have been waving the microserver banners for Hadoop data munching, Memcached Web caching, and other hyperscale Internet workloads for which having a big, fat, powerful processor core is not always as important as having smart interconnects and core designs when it comes to running these distributed workloads.
SeaMicro, which builds a dense box based on dual-core, 64-bit Atom servers that crams 768 cores into a 10U chassis, recently showed off some Hadoop unstructured data crunching it had done on its machines, and compared it to the work that could be performed on plain-vanilla Xeon boxes. The SM1000 server they tested running a real-world Hadoop workload at a customer site could do it for about 25 per cent less money than a cluster of Xeon servers, in one quarter of the rack space, and burning one quarter of the juice.
Memcached was created in 2003 by Danga Interactive as a distributed Web cache that stores data in main memory and makes it accessible to Web servers and applications. Memcached is what is called a key-value store, and it is now used by Facebook, Twitter, Zynga, YouTube, Reddit, Flickr, and a slew of hyperscale internet companies that need to serve up data to millions of users, and can't wait for disk drives to do the job.
"Facebook is known as the king of Memcached, and they run the most Memcached servers in the world, as far as we know," Ihab Bishara, director of cloud computing applications at Tilera, tells El Reg. "This is a tier-one customer validating the claims that we have been making for the past year and a half."
Bishara is not authorized to talk about Facebook's server plans or if the company has already installed the Tilera servers made by Quanta Computer inside its production infrastructure. Quanta is, of course, the Taiwanese PC and server manufacturer that has just teamed up with Facebook to help it manufacture its homegrown, open source Open Compute servers, which debuted in April with its Prineville, Oregon data center and which will be updated this summer when new Xeon E5 and Opteron 6200 processors are announced by Intel and AMD.
Facebook did its Memcached tests on the Quanta QS2 rack server – also known as the QSSC-X5-2Q – which crams 512 cores across eight processors in a 2U rack-mounted chassis.
Each processor is implemented as a single node, so the Quanta server is really an eight-node microserver. Four of the cores on the 32-bit TilePro64 processor were allocated to run Linux, leaving the other 60 cores to run the Memcached workload. The cores, which are widely believed to be a derivative of the MIPS architecture, ran at 866MHz and have several mesh interconnects to glue together memory and I/O across the cores. (See this story for details on the Tile family of chips.) The TilePro64 server node had 32GB of main memory.
Facebook lined up the Tilera-based Quanta servers against a number of different server configurations making use of Intel's four-core Xeon L5520 running at 2.27GHz and eight-core Opteron 6128 HE processors running at 2GHz. Both of these x64 chips are low-voltage, low power variants. Facebook ran the tests on single-socket 1U rack servers with 32GB and on dual-socket 1U rack servers with 64GB.
All three machines ran CentOS Linux with the 2.6.33 kernel and Memcached 1.2.3h.
There's a lot of very detailed Memcached performance information in the Facebook paper that describes how the TCP and UDP protocols affect performance on these various machines, but this graph is a good snapshot of how the machines stack up:
Memcached performance on Opteron, Xeon, and Tilepro64 servers
As you can see, the capacity in transactions per second is not very good for the x64 servers when it comes to Memcached scalability. On the Opteron machines, for example, going beyond four cores actually hurts performance and adding a second CPU gets you precisely nowhere.
The Xeon chips do a little bit better, but adding the second processor also gets you nothing. It would be better to scale up multiple single-socket Opteron or Xeon nodes – as Quanta is doing with the Tilera chips.
But what it immediately obvious is that – at least compared to low-power, low-core Opterons and Xeons – the TilePro64 with 30 cores can meet or beat what these x64 chips can do. And with 60-cores dedicated to Memcached, the TilePro64 crushes the x64 chips.
Obviously both Intel and AMD have more modern processors than these, and soon will have even newer ones. Tilera has just started sampling its Tile-Gx 3000 series of 64-bit, 36-core chips, which will eventually scale up to 100 cores, too.
Performance is only one component of the system issues with which a company like Facebook is dealing, such as electricity use and thermals (two sides of the same coin), physical size, and cost. Facebook shed some light on the power use in its paper, too. Based on the performance estimates for the machines tested, here is how the machines stack up in terms of electricity consumed:
Performance and power consumption of Tilera and x64 servers
Based on these measurements, Facebook then extrapolated how many nodes it would take to build a 256GB Memcached cluster, and then looked at its performance and power efficiency – and Tilera's chips stomped Intel's and AMD's.
An eight-node Quanta server using the TilePro64 chips could handle 2.68 million TPS and burned 462 watts, delivering 5,801 TPS/watt. A four-node Opteron cluster could deliver 660,000 TPS on the Memcached workload while burning 484 watts, delivering a mere 1,363 TPS per watt. And a four-node Xeon (with 256GB in aggregate as well) delivered more oomph than the Opteron machines at 752,000 TPS, and also consumed less power at 400 watts. But those four Xeon machines could only deliver 1,880 TPS/watt – less than a third of the bang per watt of the TilePro64-based machine.
And to top it all off, the TilePro64 machine only took up 2U of space, compared to 4U for the x64 boxes. ®