SGI readies first Project Mojo supers
Sticking it to x64 racks and blades
Supercomputer maker Silicon Graphics is just about finished with the initial designs of the "Project Mojo" dense-packed HPC machines, sources tell El Reg.
If you have been pondering how SGI was going to be able to cram a petaflops of supercomputing oomph into a rack of servers, as the company promised back in June it would be able to do within a year's time or so, the answer is something SGI is calling a stick as well as a new class of racks for the consolidated SGI and Rackable Systems server product lines. And the answer also appears to be that it can't quite cram a petaflops into a rack after all, but has nonetheless come up with a more compact design than traditional rack and blade servers offer.
Months before the US military's Defense Advanced Research Projects Agency issued its Exascale Challenge informally in June 2009 and formally in March 2010, the engineers at SGI, the result of the then newly merged supercomputer maker Silicon Graphics and hyperscale server maker Rackable Systems, were kicking around the idea of how they might cram a petaflops into a server rack. Even though SGI was measuring those petaflops in single precision mode, which made the task a bit less difficult than it going double-precision, Project Mojo would still require a different approach from just plunking GPU or other kinds of co-processors onto existing SGI rack and blade servers.
"We started the Project Mojo design with the GPU and PCI-Express form factor that they come in and wrapped the CPUs around them," explains Bill Mannel, vice president of product marketing at SGI, "rather than starting with an existing server first and then adding GPUs."
SGI was vague back in June about how it would package the CPUs and their co-processor accelerators, but it did say it planned to use FireStream GPUs from Advanced Micro Devices and Tesla GPUs from Nvidia for floating point jobs and massively multicored mesh processors from Tilera to accelerate integer processing for the main CPUs in the Project Mojo system.
As it turns out, the stick of the Project Mojo system is a computing element that is nearly as long as the rack is deep - three feet - with the width and a little more than the height of a double-wide PCI-Express peripheral card. Mannel wouldn't say what processor is implemented on the stick, but it is possible that SGI has variants with both Intel Xeon and AMD Opteron processors. Considering that Project Mojo is an experimental system with limited sales on the front end, it is reasonable to conjecture that SGI will start with Xeons and expand into Opterons if there is customer demand.
Each stick has room for two double-width fanless GPU co-processors and two processor sockets. Each socket gets its own GPU in the floating point models; it is unclear how many Tilera chips will be in the integer models.
The Project Mojo systems will come in two racks and with two different stick capacities. The high-end box will use a modified version of the 24-inch blade racks employed by the Altix UV 1000 supers, which are based on Intel's Xeon 7500 processors and SGI's NUMAlink 5 shared memory interconnect, while another will be based on a new 19-inch rack, code-named "Destination," that aims to replace the 20 different racks that SGI inherited from the merger of SGI and Rackable Systems. The modified 24-inch Altix UV rack will hold 80 sticks, each with two CPUs and two double-wide GPU co-processors. The 19-inch Destination rack will be able to hold 63 sticks.
Assuming SGI can employ the AMD FireStream GPUs announced in late June, and based on the "Cypress" GPUs, in the Project Mojo boxes, then the larger 24-inch rack machine using the double-wide FireStream 9370 should hit 422 teraflops of aggregate GPU performance and the smaller 19-inch rack should come in at 332.6 teraflops. The CPUs won't add much to the processing capacity.
Using Nvidia's double-wide, fanless Tesla M2070 GPUs, then the Mojo stick will be rated at 2.06 teraflops in single precision, which adds up to 164.8 teraflops for the 24-inch rack and 129.8 teraflops for the 19-inch rack. The AMD FireSteam 9370 has a huge single-precision advantage over Nvidia, but the AMD 9370 card weighs in at only 528 gigaflops doing double-precision math, compared to 515 gigaflops for the Tesla M2070. As for double precision, biggest Project Mojo system will only deliver 82.4 teraflops with the Tesla M2070s and 84.5 teraflops with the FireStream 9370s.
It would make far more sense to put two processor sockets and four single-width fanless GPUs on the Mojo stick. Doing so using the AMD FireStream 9350 fanless GPU co-processors would yield 8 teraflops of oomph per stick, or 640 teraflops of aggregate GPU floating point performance at single precision. With four single-width Tesla M2050s, the 80-stick rack could deliver 329.6 teraflops of SP number crunching.
For now, SGI isn't saying what all the possible GPU co-processor configurations will be.
By the way, SGI never promised to have a petaflops of oomph out the door on day one, but merely said that there should be multiple ways of getting to a petaflops within one year's time. These initial Mojo sticks are just the first pass. That said, to hit even a petaflops at single precision, the Project Mojo sticks are going to have to more than double up GPU oomph.
An increase in GPU performance could prove to be problematic if AMD can't get the "Northern Islands" kickers to the current Cypress chips out the door by the end of this year, as planned. The rumors suggest that the Northern Island GPUs being fabbed by Taiwan Semiconductor Manufacturing Corp have indeed hit delays, and that there is some stopgap GPU, possibly to be fabbed by GlobalFoundries, called Southern Islands. Both AMD and Nvidia have been tight-lipped about their future GPU roadmaps, but they will probably start talking them up at the GPU Technology Conference in San Jose next week.
The Project Mojo sticks using Tilera co-processors could cram a lot of integer punch. Tilera says that the performance of the Tile-Gx series of chips maxxes out at 750 billion operations per second, which works out to five instructions per clock (using what everyone assumes is a modified MIPS RISC core) running at the top-end 1.5 GHz speed these future Tile-Gx100 chips will hit in 2011.
Assuming you could get eight Tilera 100-core chips on a Project Mojo stick and the same 80 sticks in a 24-inch rack, that works out to 480 trillion integer operations per second. You need a little more than twice this density to do integer math on the analog of a petaflops in floating point performance, which is a quadrillion (1015) integer calculations per second. Luckily, Tilera is working on a 200-core chip, due around 2013, which should help SGI hit that goal.
As El Reg previously reported, Tilera is working with Chinese PC maker and server wannabe Quanta to put eight half-width server nodes based on a single TilePro64 64-core processor running at 900 MHz into an SQ2 rack server with a 2U form factor. That 2U machine with eight server nodes is rated at 1.3 trillion integer operations per second. You would need 769 of these 2U servers, or over 18 racks, to hit that quadrillion integer operations performance level, using these TilePro64 chips and that tray server design.
Mannel says that SGI has the design for the Project Mojo machines more or less done now, and the prototypes of the supers will be on display at the SC10 supercomputing conference in November down in New Orleans. The sticks and racks will begin their initial shipments in December. ®