Nvidia's Fermi hits flop-hungry challengers
HPC players tool up
Nvidia's Fermi graphics coprocessors have begun shipping through its OEM partner channel with a slew of tier-two players hoping the flop-happy GPUs give them a competitive edge against established players in the HPC server racket.
The Fermi graphics cards and GPU coprocessors that are based on them were both previewed last November at the SC09 supercomputing conference. The Fermi graphics chips previewed had 512 cores, but for reasons that Nvidia has not explained - and which probably involve chip yields and heating issues - the GeForce graphics cards and Tesla 20 coprocessors that have started shipping only have 448 working cores. And that means their floating-point performance is a little lower than expected.
The Tesla coprocessors are implemented in three different form factors, which was not apparent at the launch last November. The C series GPU coprocessors have fans on them and plug into workstation and personal supercomputers (basically, an x64 workstation on steroids); the M series, which are fanless units that are intended to be used in hybrid CPU-GPU setups within the same chassis; and the S series, which are GPU appliances that plug into servers through external PCI Express links and pack up to four GPUs into a 1U chassis.
Back in November, Nvidia was saying that the C2050 and the C2070, which had an initial rating of 520 and 630 gigaflops doing double-precision math and which cost $2,499 and $3,999, respectively, would support the 512-core Fermi chips. In early April, Nvidia started shipping the C2050, but with only 448 cores and rated at 515 gigaflops double-precision, and the C2070 was pushed out to the third quarter. It's a fair guess that with the number of cores dropping by 12.5 per cent in the C2050 but the aggregate performance of the GPU coprocessor only dropping by one per cent, Nvidia cranked up the clock speed to make up for the lower GPU core count.
There were to be two variations of the S series GPU appliances, the S2050 appliance using the C2050 GPUs, rated at 2.08 teraflops and costing $12,995, and the S2070 appliance using the faster C2070 GPUs rated at 2.52 teraflops and costing $18,995. The S series boxes aren't shipping yet, and they will be based on the 448-core C series GPUs, likely providing a little less floppy oomph. Sources at Nvidia say that the S series GPU appliances are still on track for delivery this quarter.
Nvidia started peddling the Fermi GPUs in its GeForce graphics card lineup during the first quarter.
The news today is that the Tesla M2050 embedded GPU coprocessor, which is based on the C2050 card as the name suggests and which is rated at the same 515 gigaflops of double-precision and 1.03 teraflops single-precision floating point performance, has begun shipping through OEM server partners. Appro and Super Micro were the first to announce systems using the M series GPUs. (You have to hunt around the Nvidia site to find the M2050 spec sheet, so let me save you the trouble.)
Oak Ridge boys
Nvidia planned to host a big shindig in Washington DC kicking off the M series, with Oak Ridge National Laboratory talking about how hybrid CPU-GPU systems were the wave of the future, and Georgia Tech, which has a project called Keeneland for creating applications that run on the hybrid CPU-GPU, giving presentations.
Oak Ridge is, of course, one of the first big customers for the Fermi GPUs. Last October, before the Fermi GPU coprocessors were unveiled by Nvidia at SC09 but after the Fermi chips on which they are based were detailed, the Cray XT "Jaguar" massively parallel Opteron super at Oak Ridge weighed in at 1.06 petaflops using the Linpack Fortran benchmark test as a gauge. Shortly thereafter, the upgraded Jaguar machine was pushed to 1.76 petaflops by the addition of new Opteron cores.
The only reason this matters is that in early October last year, Oak Ridge said that it would be building a hybrid CPU-GPU super based on Nvidia cards that would have at least ten times the oomph of Jaguar. Most likely meaning breaking the 10 petaflops barrier, but not the 20 petaflops barrier. Oak Ridge was intentionally vague, and perhaps because it was unsure of what the performance of such a hybrid machine might be.
There is also a rumor going around that Oak Ridge was unhappy about the performance of the Nvidia Tesla 20 GPUs and has canceled the project, but Nvidia says this is untrue. Oak Ridge has yet to say exactly what it is building.
Next page: Apropos Appro
AC: it's the same chip as the GTX480 games card that is claimed to burn a suspiciously similar 250w but actually gobbles 320; with a similar clock and more memory, it's not going to eat less. Your comparison is made even trickier because the Xeons can do a lot more than a graphics processor; the abilities of these things are pretty narrow. Most proposed GPU-assisted high performance computers need a lot of standard CPUs to keep the GPUs busy. It's pretty nifty, but it's no miracle. The ATI equivalents get more raw flops per watt - but do even less, so they need more CPU support. Same tradeoff (if you ignore coding difficulties, which you can't). Ye cannae change the laws of physics, Captain!
nVidia will continue to rule GPGPU for a while because they have a (fairly) widely known and (kinda) not too difficult language, but neither that nor this card will help them break out of fairly narrow applications where the performance per watt advantage creeps above marginal.
Graham: The transputer was a brilliant piece of engineering at totally the wrong time. It's like inventing the Ferrari in 1800; far more improvement in travel speeds from improving the roads would be needed before it was worth having. NOW we need it back, I agree... although I'm not sure I agree with your assesment of computer science; it's full of great ideas for parallelism in an ideal world but hasn't got much chance of meshing with the existing mess most people have to deal with...
It takes around 8.5 Xeon x5550 chips to equal the 630 Gflops of a C2070.
These Nehalem chips consume ~95 Watts a piece, for a total of over 800 Watts.
So the NVIDIA card has less than a quarter the power consumption of an equivalent intel solution.
Never mind FPU accelerators - how's about the whole parallel thing? Anyone else remember the Transputer? The company who sponsored me at uni (GEC Alsthom Transmission and Distribution Power Electronic Systems Limited, to give it its full title) used them, and they were pretty damn cool.
The Occam2 language in particular was a neat idea, since it gave native support for parallelism. No messing around with the detail of threads and stuff - you just said "PAR", and the paths under that statement ran in parallel. If you happened to be at the top level then those paths got spread over the separate cores, or at lower levels it timesliced, but all that was done for you. And comms between processors was equally seamless - from a software PoV it just looked the same as running over separate cores.
For a little while it was the fastest thing around. Trouble is that like all British technology, no-one was prepared to put money into it. So it drifted backwards until a single 486SX25 could comfortably blow away a bunch of Transputer cores, and that was that.
Now of course we're back where we started, bcos single cores have basically run out of speed-up potential. And of course, since Win95 programs use multiple threads to do stuff in parallel with time-slicing. So everyone has the joy of managing threads for themselves without any decent techniques for tracking deadlocks and livelocks, when back in the late 80s and early 90s anyone using Transputers had already solved this problem.
This is why I differentiate between computer science and software engineering. Engineering is about solving problems and keeping the solutions around, so you don't have to reinvent it every time. A civil engineer doesn't have to go back to first principles for a bridge, bcos after 1861 the principles for building a strong bridge have been pretty well ingrained in the profession. The trouble with computer science is that in constantly chasing the bleeding edge, they seem to have absolutely no idea of the history behind where they are, or following patterns in this. So single-core code gets hacked to explicitly support dual-core when dual-core processors come around; and then someone releases a quad-core processor and the CompSci boys need to hack their code again for "if(cores==4)". The idea that there's a pattern involved - number of cores increasing - seems to pass them by, as does the fact that back in the 80s there was a ton of work done on load-sharing across parallel processors.