Original URL: http://www.theregister.co.uk/2011/06/28/facebook_open_compute_2_preview/

Facebook reveals next-gen Open Compute wares

Double your servers, double your fun

By Timothy Prickett Morgan

Posted in Servers, 28th June 2011 00:05 GMT

Facebook's Open Compute Project, founded to open source the social media mogul's server and data center designs, has hosted its first meeting, previewing its next-generation server and storage iron.

While a lot of companies give mere lip service to compute density and performance per watt, hyperscale web companies such as Facebook will profit or not based on how well their infrastructure scales, how little it costs to acquire and operate, and how densely it can be crammed into data centers.

Back in April, Facebook launched the Open Compute Project with a vanity-free 1.5U rack-mounted chassis that sports custom two-socket motherboards based on Intel Xeon 5600 and Advanced Micro Devices Opteron 6100 processors, fed by a 450 watt power supply. Every feature not required by Facebook's applications have been ripped out of the custom motherboards, provided by Chinese mobo and server maker Quanta Computer.

With many more cores expected from future "Sandy Bridge" Xeon E5s and "Interlagos" Opteron 4200s and 6200s, Facebook's hardware design manager Amir Michael tells The Register that the company first took a stab at using fatter four-socket boxes to run its mix of applications. The idea, says Michael, was simple: with a four-socket server, you can in theory use one fast network pipe and one very efficient power supply, and eliminate some of the components you would need in a pair of two-socket servers.

But, says Michael, for Facebook workloads the SMP/NUMA architecture of a four-socket x64 creates as many problems as it solves. In prototype tests run by Facebook, four-socket x64 machines don't scale as well on machines with lots of cores – much like many other workloads out there in the real world. "You also need to take much tighter control of the memory inside of the machine," Michael explains, saying that if you are not careful, you end up having to do multiple hops inside of a machine to do processing after work has been dispatched to a node. "Performance degrades even more at that point."

And so with the next generation of Open Compute platforms, Facebook is doing what many hyperscale companies already do: putting two half-width, two-socket servers inside the chassis to double-up the compute density. This arrangement is sometimes called a twin server, and you can see a mock up of the future Open Compute server machine in a blog post – on Facebook, of course.)

Future Facebook double-stuffed server

Given that Intel and AMD have not yet announced their Xeon E5 and Opteron 4200/6200 processors, Michael is not at liberty to provide the detailed feeds and speeds on these future half-width servers. Details of the machines will be published on the Open Compute site once the chips and chipsets are announced, but Michael could talk about the chassis and server in general.

The Open Compute chassis has disk-drive carriers in the back of the chassis, and if these are pulled, a 6.5-inch by 20-inch half-width, two-socket motherboard slides right into the existing Open Compute chassis, butting up right against the fans.

Michael says that if there were a standard size for half-width boards, Facebook would have used it (as it does for the existing two-socket machines), but half-width boards range in width from 6.5 to 6.6 inches and anywhere from 18 to 20 inches in length.

The server design keeps the disk drives in the front of the chassis, as before: two mounted on each server mobo, stacked atop each other. From the mock up, it appears that a few more disks are packed onto the right of the machine.

The processors are behind the disks, and because of the tight packing of the components, airflow will be warmed as it passes over the disks, to the first processor, and then to the second processor. The earlier Open Compute designs tried to avoid "shadowing" – server-speak for having one component heating up the cooling air for another component – but in a twin design, this is very tough to avoid.

Given this, Facebook knew that it would need to crank up the fans a bit to keep it all cool, but fan power was only increased from about 2 per cent of overall server power draw to around 3 per cent, according to Michael.

The server node is based on 1.3 volt memory, not the standard 1.5 volt sticks used in servers last year, and Michael doesn't think it will be long before 1.25 volt memory will become more common, helping to shave power consumption inside the box – a little.

Having two whole servers in the 1.5U chassis also means needing more juice, but even here, Facebook is pushing up efficiency, moving from one 450 watt power supply to a single 700 watt unit.

Bumped performance, looser thermal margin

It's too early to tell what kind of performance boost to expect from the future servers, but Michael says that given the coming increases in core counts, clock speeds, memory capacity and speed, and other factors, Facebook expects a server node to deliver at least 50 per cent more oomph on its workloads. And this time around, there's a bit of thermal-envelope margin should Facebook want to goose a component or two.

The future motherboards that Facebook designed with Quanta have extra I/O lands to support 10 Gigabit Ethernet ports, and PCI Express mezzanine cards to add more I/O capability. The machines also do away with external baseboard management controllers on the Intel mobos, exploiting Intel's Management Engine BIOS Extension and a subset of functions in Intel's chipsets to do all the remote management functions that the BMC service processor was doing. This functionality is not available in the AMD-based machines, but Facebook is going with a barebones BMC rather than some high-end – and relatively costly – option.

Facebook goes through four phases as it rolls out servers: EVT, DVT, PVT, and mass production. EVT is short for engineering verification and test, when prototype boards come back from ODM partners and low-level signal checking is done on components.

The design verification and test phase – DVT – comes next, when a set of higher-level tests are done on prototype systems. In this phase, Facebook looks for system flaws and also performs early tests on its software stack.

The PVT phase – production verification and test – requires component suppliers to simulate their production of components and completed systems and deliver completed systems, preinstalled in racks, to Facebook data centers. Production workloads are run on the boxes in the PVT phase, and once they pass muster, Facebook places the big order and mass production begins.

In addition to engineering the new servers, Facebook also had to tweak the battery backups to handle the additional load. The battery cabinets that Facebook designed as companions to its rack servers can now take 85 kilowatts of load, up from 56 kilowatts in the first generation of machines.

Facebook Open Compute storage array

The Open Compute storage array

Michael also showed off a storage array that puts two disk controllers and two sets of 25 disk drives into a single chassis. The blog post says that the design provides flexibility by allowing you to vary the ratio of storage capacity to compute capacity to reflect the needs of different workloads.

This disk array is still in its testing phases, so Michael was a bit cagey about what is in the box, but it reminds me of Sun Microsystems' "Thumper" X4500 storage arrays, which were based on a two-socket Opteron motherboard with six eight-port SATA disk controllers on the board.

In both the Facebook and Sun arrays, the disk drives mount vertically into the chassis from above, rather than horizontally as they usual in servers. It looks like the Open Compute storage array is doing five rows of five per block, and putting two blocks into the box.

Given the stinginess of hyperscale data center operators, those disks are almost certainly cheap 3.5-inch SATA drives. ®