Penguin Computing muscles into the ARM server fray
Aiming Cortex-A9 clusters at Big Data
Linux cluster supplier Penguin Computing is diving into the low-power ARM microserver racket and has tapped server chip upstart Calxeda – which has just rolled out its multiyear product roadmap for its EnergyCore processors – as its chip and interconnect supplier for its first boxes.
The new machine, called the Ultimate Data X1, is based on the twelve-slot SP12 backplane board created by Calxeda, just like the Viridis server from UK server-maker Boston. The experimental "Redstone" development server from Hewlett-Packard also uses the SP12 backplane board, putting the three of them on a full-depth SL6500 server tray and four trays in a 4U chassis for a total of 72 server nodes in a 4U space.
The Calxeda EnergyCard system board puts four quad-core EnergyCore ECX-1000 processors onto a board, plus memory slots and SATA ports with two PCI-Express connectors to link each pair of sockets into the backplane. Each processor is based on for Cortex-A9 cores, which support 32-bit memory addressing and therefore tap out at 4GB of main memory in the single DDR3 slot allocated to each processor socket. Each socket has four SATA ports as well for peripherals.
The interesting bit about the EnergyCore chip is that it includes a distributed L2 switch, which can be used to hook up to 4,096 sockets into a flat cluster using a variety of network configurations, including mesh, fat tree, butterfly tree, and 2D torus interconnections of system nodes. The first generation fabric switch, which has been rebranded the Fleet Services Fabric Switch as part of the expanded Calxeda roadmap, is an 8x8 crossbar with 80Gb/sec of bandwidth, and it links out to five 10Gb/sec XAUI ports and six 1Gb/sec SGMII ports that are multiplexed with the XAUI ports.
There are three 10Gb/sec channels that come out of each EnergyCore chip that are used to link to the three adjacent sockets on the system board, so they can share data very quickly. The five other ports are used to link sockets on other EnergyCard server boards to each other. Latencies between server nodes vary depending on the network configuration and the number of hops it takes to jump from socket to socket across the cards and backplanes, but working through the Fleet Services distributed network, you can do a node-to-node hop in about 200 nanoseconds, according to Calxeda.
That's better than the performance of most low latency 10 Gigabit Ethernet top-of-rackers aimed at high freaky trading. If that interconnect could do cache coherency across all (or even a large portion) of those 1,024 EnergyCard nodes, we'd be calling this the God Box.
Boston's Viridis machine puts one SP12 backplane card, a dozen EnergyCards, and two dozen 2.5-inch drives into a 2U chassis. But Penguin Computing, seeking to use cheaper and fatter 3.5-inch SATA drives, has opted for a 4U chassis for the UDX1 that houses three dozen 2TB fatties as well as a single SP12 backplane board and a dozen EnergyCards. There are 24 drives across the front of the unit and another 12 drives buried inside the unit, giving a higher disk-to-core ratio than the Viridis box.
Penguin Computing's UDX1 ARM server
Loaded up, the UDX1 machine will run around $35,000, with variation depending on CPU speed, memory capacity, and disk options. Penguin Computing says that depending on how modern your x86 clustery is, this 4U chassis can replace anywhere from a quarter to a half rack of x86 iron and switches.
Penguin Computing will be showing off the UDX1 system at Strata Hadoop World next week in New York, apparently with much-awaited benchmarks for Hadoop Big Data munching. It is not clear how the balance of CPU cores, memory, and disk drive spindles will play out on ARM servers, but it could turn out that the number of spindles does not need to be as large per socket as on x86-based machines.
In general, the number of drives per socket has been going up with the number of cores per socket on x86-based machines, and generally speaking, Hadoop machinery likes to have one disk per processor core and you tend to use the fattest disk you can afford. The speed of the drive and the speed of the core take a back seat capacity, given the volumes of data that most Hadoop clusters are wrestling with.
In general, Hadoop clusters are not network I/O or CPU bound, like many traditional supercomputer workloads, but rather disk-bound since you can only make a disk drive spin so fast or hold so much data. (The fatter drives move slower, so there are tradeoffs here, too.) But it will be interesting to see what a balanced Calxeda server might look like.
Based on the Penguin setup, which has 192 cores and 36 drives, it looks like the machine is a little bit light on disks. The wonder is why Penguin Computing didn't build a 5U chassis with 48 drives, one for each EnergyCore socket in the box, and I will ask the company about that when I see the demo next week. The answer might be that you only use eight EnergyCards in the box as Hadoop compute nodes and use the remaining four nodes as NameNodes and other management nodes in the cluster, giving you a socket-to-disk balance.
The UDX1 machines might be configured to be suitable not just for Hadoop, but other Big Data workloads like risk analysis, genomics, and seismic processing where computing oomph is important but so is fast networking and flat networks.
It will be interesting to see what the power draw, performance, and cost is on the UDX1 is running various workloads and how that compares to Xeon and Opteron machinery configured with 10GE ports and switches. ®
Cache coherent interconnect would only be useful if...
...the processors have enough physical address bits to allow direct addressing across all the memory attached to the interconnect. Cortex-A9 can only address 4GB in total, so to get anywhere near addressing the memory on 4096 sockets, you'd only be able to put 1MB on each socket, which seems a bit small for today's software... :-)
Also, are you sure you even *want* it? I was somewhat involved with the 1536 processor Altix 3700 system that was installed at my workplace (nf.nci.org.au); it seemed that SGI were keen for us to run it as a few honking big SMP boxes, but the exposure to component failure that you get with a few huge SMPs means it only really makes sense for jobs which necessarily take the whole system. AFAIK that's how NASA ran their Columbia Altix cluster. We ran a big mix of workloads across that number of CPUs, and so a failure that crashed a 512-1024 CPU SMP would have killed a lot of jobs that had no dependency on the failed part.
Even when the Altixes were run as a cluster of 32-64 CPU SMPs, with the same interconnect serving to run MPI between SMP boxes, the cache coherency in the interconnect was still there, and could lead to cascading failures if you didn't shut down a failed SMP box in just the right way; memory shared between SMP nodes for MPI communication was actually cache coherent with the other nodes mapping the same memory, so a failure in one node could cause other nodes to fail if cache lines for other machines' memory got "stuck" on the failed node. Not pretty, and made worse by all the Custered XFS storage fencing that happened as nodes died; if enough fence-outs happened in a short time, it could cause the Brocade director class FC switches to hang, with further ensuing hilarity.
Moral is, be very sure that you want the very tightly coupled thing, because you'll pay for the complexity one way or another...
This sounds like the set up for an episode of campy 1960s Batman.
The question is why has The Penguin become so interested in ARM microservers. What is his fiendish plan?
And here I was gonna ask if it ran the new Win-ate, innately.