IBM recasts Power 775
Blue Waters super as big data muncher
GPFS storage servers don't need no stinkin' RAID controllers
IBM has yanked the chain on the petaflopping 'Blue Waters' supercomputer that it was going to install at the University of Illinois this fall because it was too expensive to make at the budgeted price.
But that doesn't mean the Power 775 server nodes that comprised IBM's Blue Waters machine are not useful. Ditto for the PERCS software stack that IBM created for the US Defense Advanced Research Projects Agency, which partially funded the development of the machine. And so IBM is gradually commercializing bits of the Blue Waters/PERCS machine, starting with two storage servers tuned up specifically for big data jobs.
The Productive, Easy-to-use, Reliable Computing System project, or PERCS for short, got its initial $53m development award from DARPA in July 2003 for creating a better programming environment, and in got a secondary $244m grant for full system development in November 2006, including the hardware that was eventually launched as the Power 775 in July 2011.
IBM is currently building a Power 775 cluster for DARPA to play with, and the National Center for Supercomputing Applications at the University of Illinois was supposed to buy a larger cluster in the range of 10 petaflops based on the same iron, but Big Blue backed out of the deal because it was going to lose money. (NEC and Hitachi did the same thing with the K supercomputer in Japan after most of the development was done, leaving Fujitsu alone to build the grunting 10.51 petaflops Sparc64-based machine.)
The Power 775 is based on the eight-core Power7 processor, and puts four of these running at 3.84GHz onto a single multichip module. This module dissipates 800 watts, and hence it needs to be water cooled, but it has as much oomph as an enterprise-class server.
The Power 775 node has eight of these units on it, for a total of 256 cores and delivering 9 teraflops of double-precision floating point oomph, on a single and massive motherboard. IBM was originally projecting around 8 teraflops when the machine was previewed back in November 2009.
Each MCM chip socket has sixteen memory slots, and you put 16GB sticks in there for 2TB of memory on that single node. Each compute module linked into a 1,128GB/sec hub/switch module that fits into the same style of socket as the compute module; there's one of these PERCS hub/switch modules per node, which links the eight MCMs to each other through optical links and to the 16 PCI-Express peripheral slots at the back of the node.
The hub/switch also reaches out through electrical links out to switches in other nodes. You can, in theory, scale this PERCS machine to 2,048 nodes, or 524,288 cores for around an aggregate of 16 petaflops.
The Blue Waters/PERCS setup also has companion 4U drawer that holds 384 disk drives, and the interesting bit about the PERCS project is that IBM extended its General Parallel File System (GPFS) for supercomputers to provide it with software-based RAID parity protection and data striping.
What this means is those internal storage arrays don't need to have hardware-based RAID but instead steal some processor cycles from the Power7 processors to implement RAID data protection. It wouldn't be surprising to see IBM put a RAID unit on future Power8 chips to speed this up and offload it from the CPUs proper.
IBM didn't just implement RAID 5 data protection in software with GPFS Native RAID, but tweaked the algorithms underlying it a little bit to make it more resilient and more resilient and better performing in the event of disk failures. This is something you have to do in a multi-petaflops system that could have in excess of 100,000 disk drives in its related storage cluster.
The tweaked RAID data protection in GPFS Native RAID
IBM is using Reed Solomon encoding on parity data, which is generated as files come into a RAID group and which is then used in reverse to recreate missing data after a drive failure. But rather than group drives in units of five, with the fifth drive holding the parity data, IBM is "declustering" the RAID sets in an array of 20 drives, with data and parity stripes uniformly partitioned and distributed across a 20-drive enclosure.
In fact, the RAID set can be even larger than this if you want. That means in the event of a failure of one drive, the remaining 19 in the set are all helping to serve up missing data while the failed drive is being rebuilt rather than just hamming the one RAID group if you only have a five-drive set.
GPFS Native RAID also has end-to-end checksum and is able to detect and correct off-track and lost or dropped disk writes. The funky new software RAID that comes out of PERCS also has a bunch of error diagnoses that operate in asynchronous mode while a RAID set keeps taking in and pumping out data.
If the media on a drive fails, it tries to verify the bad block and recover it. If a compute node can't get to a disk, it tries an alternate path to the drive. And if a drive is unresponsive, the drive can be power cycled by the RAID software. (Yes, I have tried turning it off and on again.)
GPFS native RAID also has solid state drives in the hierarchy of storage, and use SSDs for temporarily storing small files and maintaining system logs.
Where GPFS fits in IBM's high-performance storage
IBM has also come up with variants on parity striping called RAID-D2 and RAID-D3 that allow you to create an eight-drive RAID set and have two or three drives storing copies of the parity data for extra redundancy. You can also do the normal two-way mirroring with GPFS Native RAID or you can do three-way or four-way mirroring if you are really paranoid.
That mirroring, by the way, might be very handy for MapReduce big data munching jobs such as those run atop Hadoop if it is faster than the triplicate data copying done in the Hadoop Distributed File System.
The GPFS Storage Server is based on GPFS 3.5 and includes the RAID underpinnings, which were developed by coders at IBM's Almaden research lab and which are coded in C++, like the file system itself.
The Power 775 GPFS storage
The GPFS Storage Server based on the Power 775 nodes is a beast. The base configuration comes with one drawer of processors and memory that does compute jobs and that runs GPFS Native RAID. The custom 30-inch rack can have up to five disk drawers, with each having 384 SAS drives and 64 quad-lane SAS ports each.
Using 900GB SAS drives, a single rack can hold 1.73PM of raw disk, which formats down to 1.1PB with GPFS Native RAID. You can put up to two compute drawers in the system, for a total of 18 teraflops of number-crunching capacity.
IBM's presentation says that a single rack of the GPFS Storage Server based on the Power 775 technology can perform a "1TB Hadoop TeraSort in Less Than 3 Minutes!" There is no such thing as a TeraSort test, but there is a TeraByte Sort test, which you can see here.
Back in 2008, Yahoo! ran a TeraByte sort on a cluster of two-socket, four-core Xeon servers with 910 nodes that could sort a terabyte of data in 3.48 minutes running Hadoop. That was around ten and a half racks of standard 19-inch servers to the single 30-inch rack of iron, which is a lot of compression to be sure. But it was also four years ago using older iron on the x86 side.
Modern x86 machines would do better and in much less space. Those new ProLiant SL4500 "big data servers" from Hewlett-Packard, for instance, might do quite well here. Ignoring improvements in the Hadoop software itself over the past four years, it would only takes eleven enclosures (a little more than a rack) of these SL4500s to match the 512 cores in the Power 775 setup tested, but if you want the same 1,920 disk drives, you need a little more than four racks of the SL4500s. In theory, that is. In practice, it might take a lot fewer disks to do the TeraByte Sort test.
If there is one drawback, it is that the Power 775 machines are not cheap. At list price, the Power 775 compute drawer costs $560,097 and you need to pay another $332,736 to activate all of the memory on the node.
The custom rack, which is necessary because of the intense water-cooling on the processors, switch, and memory, costs $294,400, while the loaded up disk drawer would cost you on the order of $473,755. So a rack with two compute drawers and five disk drawers has a hardware list price of $3.78m, and that is before you put any software on it.
IBM is not talking about how much it is charging for GPFS Native RAID, but at those hardware prices, it ought to be bundled in. You do save money on RAID controllers, of course. . . . In any event, these are not necessarily the prices IBM is charging for the GPFS Storage Server, but rather the list prices it was going to try to charge for Power 775 machines a year ago.
This may be a little bit rich for your blood, and so IBM has cooked up some System x server racks running GPFS Native RAID as storage servers. In fact, the 4.14 petaflops JuQueen BlueGene/Q supercomputer at the Forschungszentrum Juelich in Germany will be using a variant based on System x iron, not the Power 775 nodes.
GPFS Storage Servers based on System x servers
The System x variant of the GPFS storage server is based on the System x3650 M4 server and the "twin tailed" JBOD disk enclosure from IBM, which packs 60 disks in the back and the front of the chassis. The chassis is a variant of the enclosure used in the DCS3700 disk arrays, Matt Drahzal, technical computing software architect at IBM, tells El Reg.
The base GPFS Storage Server Model 24 has two servers and four enclosures with 232 disks and six SSDs in a total of 20U of space. IBM is offering customers either 2TB or 3TB SAS drives.
The Model 26 storage server has two System x3650 M4 servers plus six enclosures for a total of 348 drives and six SSDs in 28U of space. And the high-end HPC edition has six System x servers with 18 disk enclosures for a total of 1,044 drives and 18 SSDs. You can link into the system by InfiniBand or 10 Gigabit Ethernet, depending on the features you put in the servers behind the arrays.
Pricing for the GPFS Storage Servers based on either Power 775 or System x3650 M4 nodes was not announced. ®