Feeds

Nvidia previews next-gen Fermi GPUs

The supermodels of HPC: hot, and worth it

Next gen security for virtualised datacentres

SC09 Graphics chip maker and soon-to-be big-time HPC player Nvidia raised the curtain a little higher on its next-generation of graphics co-processors at the SC09 supercomputing trade show in Portland, Oregon, this week, and it is arguable that the GPU co-processors aimed at personal supers and massive clusters alike were the star of the show.

The next-generation GPU co-processors were developed under the "Fermi" code-name and the details of which were previewed by El Reg last month and featured in a future hybrid supercomputer deal at Oak Ridge National Laboratory.

(Oak Ridge is, of course, home to the Jaguar massively parallel Opteron-Linux super built by Cray and currently the top of the Top 500 super charts. While Jaguar does not use GPU-co-processors, it very well could before too long; Oak Ridge has been vague about its GPU plans.)

The naming conventions of the video card versions of the Fermi chips will be called the GeForce 300 M line, as we reported at the end of October. And at the SC09 event in Portland, Nvidia announced that it is keeping the Tesla brand for its next generation GPU co-processors for workstations and servers. The Fermi chips will be sold under the Tesla 20 brand, as it turns out.

Flavors - and then some

According to Andy Keane, general manager of Tesla supercomputing at Nvidia, the Tesla 20 cards will come in two flavors and the company will sell co-processor systems that can plug right into HPC clusters and link to servers through PCI-Express 2.0 links - and at around 130 watts. Keane bristles at anyone who claims that a fully burdened heat budget for a server - not just a microprocessor, but its memory controller (if it is not integrated), its chipset, and its memory - will be any lower.

With the Fermi family of GPUs, Nvidia is adding L1 and L2 caches to the co-processors and is putting ECC memory scrubbing on internal GDDR5 video memory on the card as well as accesses to external server memory. This ECC support, as it turns out, is as important as anything else in the chip if you want to sell GPUs to nuke labs.

They can't have memory errors crash an application that may take weeks or months to run and they have to trust the answers they get. (IBM's Cell co-processors, used in the number two "Roadrunner" Opteron-Linux supercomputer installed at Los Alamos National Laboratory, have error correction for their memory. But as far as I can ascertain AMD's Radeon graphics cards and Firestream GPUs do not have ECC.)

The Fermi chip has 512 cores, which is a little more than twice the cores of the first Tesla GPUs. The Fermis bundle 32 cores together into a streaming multiprocessor that has 64 KB of shared L1 cache. All 512 cores have access to a shared 768 KB L2 cache, and they support the IEEE 754-2008 double precision floating point standard.

The Fermi chip can, in theory, address up to 1 TB of memory, but the Tesla C2050 GPU co-processor has 3 GB of GDDR5 memory and double precision floating point performance of 520 gigaflops; it has a list price of $2,499. The Tesla C2070 GPU has 6 GB of GDDR5 memory and is rated at 630 gigaflops; it costs $3,999. The bang for the buck is best with the smaller unit, which weighs in at $4.81 per gigaflops compared to the $6.35 per gigaflops of the faster GPU.

The Nvidia Tesla 20 series appliances cram four GPUs into a 1U form factor, with four links out to server nodes. The S2050 uses the slower C2050 GPUs and is rated at 2.08 teraflops and will cost $12,995 when it ships. That works out to $6.25 per gigaflops, so you are paying an extra $2,299 for the server that wraps around four Tesla 20 GPUs.

A 1U appliance with four of the faster C2070 GPUs delivers 2.52 teraflops of double-precision floating point performance and costs $18,995, or $7.54 per gigaflops. By comparison, a prior-generation C1060 GPU with 240 cores and delivering only 78 gigaflops at double precision cost $1,699 when it started shipping in June, or about $22 per gigaflops. (No one really cares about single precision and I am ignoring it.)

The Tesla 20 GPU co-processors and the appliances based on them will be available in the second quarter of 2010, says Keane. The GeForce graphics cards based on the same GPU chips will start rolling out in the first quarter.

GPU for you, sir

There's some other secret sauce in the Fermi GPUs that are going to get HPC nerds thinking about using GPUs.

For one thing, they will support Nvidia's C++ compiler, not just C. Keane dodged exactly when C++ would be ready, and laughed at the idea that Intel's own C++ or Fortran compilers would be ported to CUDA. The Portland Group said this week at SC09 that it has tweaked its popular Fortran compiler to work within the CUDA parallel programming environment that Nvidia created for its graphics cards and co-processors; this Fortran has been in beta testing for about three months.

There are projects that have pulled the CUDA libraries into the popular Matlab tool as well as the R and Python programming languages, and Java applications have been able to be bound into CUDA environment for about a year. The CUDA tool already snaps into the open source Eclipse and Microsoft Visual Studio development tools.

Another secret sauce also revealed at SC is a set of new InfiniBand and Tesla drivers that InfiniBand chip maker Mellanox and Nvidia have cooked up to streamline the movement of data from the InfiniBand ports, to the CPU's main memory, and then down through the PCI-Express bus to the GPU card.

According to Keane, the way it works now, data comes in over InfiniBand, works its way into main memory and is copied; before it is moved down to the GPU, it is copied again and that copy is what is moved. The driver changes allow for the data moved into memory to be moved down to the GPU in one fell swoop, and on early tests on clusters that use GPUs and InfiniBand together, Nvidia and Mellanox have been able to demonstrate a 30 per cent speedup for applications.

Ideally, said Keane, you want the data to move direction from InfiniBand to the PCI-Express bus and on out to the GPU memory, where the data processing actually takes place. The CPU is relegated to a traffic cop, and only gets data in its memory when the application requires it to do some processing. This capability is not available yet, and Keane didn't say when to expect it, either.

Finally, Nvidia this week released the beta of the CUDA Toolkit 3.0, which exploits the Fermi GPU's features. ®

The essential guide to IT transformation

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Death by 1,000 cuts: Mainstream storage array suppliers are bleeding
Cloud, all-flash kit, object storage slicing away at titans of storage
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
VMware vaporises vCHS hybrid cloud service
AnD yEt mOre cRazy cAps to dEal wIth
El Reg's virtualisation desk pulls out the VMworld crystal ball
MARVIN musings and other Gelsinger Gang guessing games
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
7 Elements of Radically Simple OS Migration
Avoid the typical headaches of OS migration during your next project by learning about 7 elements of radically simple OS migration.
BYOD's dark side: Data protection
An endpoint data protection solution that adds value to the user and the organization so it can protect itself from data loss as well as leverage corporate data.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?