Cray lands $188m Blue Waters NCSA contract

Breaking 10 petaflops with Opteron-GPU tag team

HP ProLiant Gen8: Integrated lifecycle automation

SC11 If there was a reason that Cray CEO Peter Ungaro, who formerly ran IBM's high performance computing biz, was a little extra perky when the company announced its third quarter financials two weeks ago, it was not just because the SC11 supercomputing trade show was coming to Cray's hometown of Seattle this week. Or that Advanced Micro Devices was once again late with an Opteron launch and had hurt its numbers.

It was because Ungaro knew that the National Science Foundation had ordered the budget for the "Blue Waters" petaflops-scale super to be rejigged and sent out for rebid – and Cray was in the hunt to win what turns out to be a $188m deal.

Bill Kramer, deputy project director of the Blue Waters project at the National Center for Supercomputing Applications at the University of Illinois, tells El Reg that Blue Waters was not a specific system, but rather a complete set of infrastructure, including a data center, plus computation, networking, and storage and, most importantly given the software goals of the NCSA, code that scales to real-world petaflops performance.

"We wanted to change the computational elements, and we got approval after a peer review," says Kramer.

So out goes IBM's super-dense Power 775 cluster version of Blue Waters, which Big Blue pulled the plug on back in August because the Power7-based machine was more expensive to manufacture than the company thought when it won the competitive bidding for the project back in 2007.

In comes the largest system that Cray has built in its history – at least until the 10 to 20 petaflops "Titan" ceepie-geepie hybrid that Cray is building for Oak Ridge National Laboratory is installed and if it is fully extended.

Like Titan, the Blue Waters system that Cray is building for NCSA will be a mix of standard XE6 Opteron blade server nodes and XK6 mixed CPU-GPU nodes, all linked together in a single network using the "Gemini" XE interconnect created by Cray. The XE6 blades will be equipped with eight sockets of 16-core "Interlagos" Opteron 6200 processors and the XK6 blades will have four Opteron sockets and one Nvidia GPU coprocessor per blade.

As with Titan, the Blue Waters system that Cray has pitched to NCSA will be based on Nvidia's next-generation "Kepler" GPUs, which were expected by the end of this year when Nvidia outed its roadmap unexpectedly in September 2010 but which are clearly coming sometime in 2012, not in time for Christmas shopping this year.

The Kepler GPUs will offer a significant performance bump over the current 512-core "Fermi" graphics engines that are used in Nvidia's GeForce and Quadro video cards and Tesla coprocessors. They are fabbed by Taiwan Semiconductor Manufacturing Corp and use its 28 nanometer processes, which should allow Nvidia to put a lot more cores into a GPU as well as boost the clock speeds a bit. Nvidia hasn't said much about the Kepler GPU design, but it looks like notebook makers are going to get first stab at them early next year and that performance will be more than double what the Fermi GPUs do today. The top-end Fermi GPUs with all 512 cores running at 1.3GHz can deliver 665 teraflops of double-precision floating point performance.

The specifications of the rebooted Blue Waters system are still a bit in flux, but here's what NCSA is getting for that $188m. The machine will have at least 235 XE6 cabinets, more than 30 XK6 cabinets, and more than 30 storage and I/O server cabinets. The resulting machine will have more than 49,000 Opteron 6200 processors and more than 380,000 cores, with another 3,000 Nvidia GPU coprocessors packing a hell of a floating point wallop as well. (If the Keplers offer twice the double-precision floating point performance as the Fermis, then those 3,000 GPU coprocessors will account for about 4 petaflops of aggregate oomph.) The plan is to put 4GB of main CPU memory per core into the machine, for a total of more than 1.5PB.

The XE6 and XK6 nodes will all be linked to each other using the XE interconnect through a 3D torus, which will require over 9,000 wires to link it all together. (That's around 4,500 kilometers of wire if you strung it all out.) The Blue Waters machine is expected to have 11.5 petaflops of aggregate peak number-crunching performance.

Kramer says that the plan is to run the 16-core Opteron 6200s processors with half the cores sleeping a lot of the time and giving the remaining eight cores full access to floating point units that are shared by each integer unit in the Bulldozer modules. By doing this, NCSA will be able to run the cores in a Turbo Core mode that can add anywhere from 600MHz to 1.3GHz of extra clocks to the chip, depending on the Opteron 6200 processor that NCSA chooses for the machine.

"We studied this quite a bit, but for a lot of our applications, it makes sense to run it like an eight-core with dual 128-bit floating point performance," says Kramer.

Cray is also building a storage subsystem for the Blue Waters machine, which will run the Lustre file system and deliver 25PB of usable disk capacity. It is not clear how that local storage will be linked into the system, that 25PB of capacity will be accessible through more than 1TB/sec of aggregate bandwidth from the cluster. There is an additional 500PB of nearline storage also being added to the machine, which will have 100GB/sec of bandwidth into the cluster. The external network coming into the Blue Waters machine will have 100Gb/sec of bandwidth at first, and will eventually scale up to 300Gb/sec.

The Blue Waters machine will run the Cray Linux Environment, the company's tweak on SUSE Linux 11 that also has an Ethernet network compatibility mode that allows applications compiled for Linux machines clustered using Ethernet to run unmodified and in emulation mode on top of the XE interconnect.

The Blue Waters machine will be delivered by Cray in phases over the next six to nine months, with the initial delivery early next year and early program deployment by the middle of 2012. The plan is to have the full machine operational by the end of 2012, which is more or less the same timing that NCSA expected with the IBM Power7 Blue Waters behemoth. ®

Reducing security risks from open source software

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
SHOCK and AWS: The fall of Amazon's deflationary cloud
Just as Jeff Bezos did to books and CDs, Amazon's rivals are now doing to it
Amazon Reveals One Weird Trick: A Loss On Almost $20bn In Sales
Investors really hate it: Share price plunge as growth SLOWS in key AWS division
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
BlackBerry: Toss the server, mate... BES is in the CLOUD now
BlackBerry Enterprise Services takes aim at SMEs - but there's a catch
The triumph of VVOL: Everyone's jumping into bed with VMware
'Bandwagon'? Yes, we're on it and so what, say big dogs
Carbon tax repeal won't see data centre operators cut prices
Rackspace says electricity isn't a major cost, Equinix promises 'no levy'
prev story


Designing a Defense for Mobile Applications
Learn about the various considerations for defending mobile applications - from the application architecture itself to the myriad testing technologies.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Seven Steps to Software Security
Seven practical steps you can begin to take today to secure your applications and prevent the damages a successful cyber-attack can cause.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.