Nvidia previews next-gen Fermi GPUs

The supermodels of HPC: hot, and worth it

Maximizing your infrastructure through virtualization

SC09 Graphics chip maker and soon-to-be big-time HPC player Nvidia raised the curtain a little higher on its next-generation of graphics co-processors at the SC09 supercomputing trade show in Portland, Oregon, this week, and it is arguable that the GPU co-processors aimed at personal supers and massive clusters alike were the star of the show.

The next-generation GPU co-processors were developed under the "Fermi" code-name and the details of which were previewed by El Reg last month and featured in a future hybrid supercomputer deal at Oak Ridge National Laboratory.

(Oak Ridge is, of course, home to the Jaguar massively parallel Opteron-Linux super built by Cray and currently the top of the Top 500 super charts. While Jaguar does not use GPU-co-processors, it very well could before too long; Oak Ridge has been vague about its GPU plans.)

The naming conventions of the video card versions of the Fermi chips will be called the GeForce 300 M line, as we reported at the end of October. And at the SC09 event in Portland, Nvidia announced that it is keeping the Tesla brand for its next generation GPU co-processors for workstations and servers. The Fermi chips will be sold under the Tesla 20 brand, as it turns out.

Flavors - and then some

According to Andy Keane, general manager of Tesla supercomputing at Nvidia, the Tesla 20 cards will come in two flavors and the company will sell co-processor systems that can plug right into HPC clusters and link to servers through PCI-Express 2.0 links - and at around 130 watts. Keane bristles at anyone who claims that a fully burdened heat budget for a server - not just a microprocessor, but its memory controller (if it is not integrated), its chipset, and its memory - will be any lower.

With the Fermi family of GPUs, Nvidia is adding L1 and L2 caches to the co-processors and is putting ECC memory scrubbing on internal GDDR5 video memory on the card as well as accesses to external server memory. This ECC support, as it turns out, is as important as anything else in the chip if you want to sell GPUs to nuke labs.

They can't have memory errors crash an application that may take weeks or months to run and they have to trust the answers they get. (IBM's Cell co-processors, used in the number two "Roadrunner" Opteron-Linux supercomputer installed at Los Alamos National Laboratory, have error correction for their memory. But as far as I can ascertain AMD's Radeon graphics cards and Firestream GPUs do not have ECC.)

The Fermi chip has 512 cores, which is a little more than twice the cores of the first Tesla GPUs. The Fermis bundle 32 cores together into a streaming multiprocessor that has 64 KB of shared L1 cache. All 512 cores have access to a shared 768 KB L2 cache, and they support the IEEE 754-2008 double precision floating point standard.

The Fermi chip can, in theory, address up to 1 TB of memory, but the Tesla C2050 GPU co-processor has 3 GB of GDDR5 memory and double precision floating point performance of 520 gigaflops; it has a list price of $2,499. The Tesla C2070 GPU has 6 GB of GDDR5 memory and is rated at 630 gigaflops; it costs $3,999. The bang for the buck is best with the smaller unit, which weighs in at $4.81 per gigaflops compared to the $6.35 per gigaflops of the faster GPU.

The Nvidia Tesla 20 series appliances cram four GPUs into a 1U form factor, with four links out to server nodes. The S2050 uses the slower C2050 GPUs and is rated at 2.08 teraflops and will cost $12,995 when it ships. That works out to $6.25 per gigaflops, so you are paying an extra $2,299 for the server that wraps around four Tesla 20 GPUs.

A 1U appliance with four of the faster C2070 GPUs delivers 2.52 teraflops of double-precision floating point performance and costs $18,995, or $7.54 per gigaflops. By comparison, a prior-generation C1060 GPU with 240 cores and delivering only 78 gigaflops at double precision cost $1,699 when it started shipping in June, or about $22 per gigaflops. (No one really cares about single precision and I am ignoring it.)

The Tesla 20 GPU co-processors and the appliances based on them will be available in the second quarter of 2010, says Keane. The GeForce graphics cards based on the same GPU chips will start rolling out in the first quarter.

GPU for you, sir

There's some other secret sauce in the Fermi GPUs that are going to get HPC nerds thinking about using GPUs.

For one thing, they will support Nvidia's C++ compiler, not just C. Keane dodged exactly when C++ would be ready, and laughed at the idea that Intel's own C++ or Fortran compilers would be ported to CUDA. The Portland Group said this week at SC09 that it has tweaked its popular Fortran compiler to work within the CUDA parallel programming environment that Nvidia created for its graphics cards and co-processors; this Fortran has been in beta testing for about three months.

There are projects that have pulled the CUDA libraries into the popular Matlab tool as well as the R and Python programming languages, and Java applications have been able to be bound into CUDA environment for about a year. The CUDA tool already snaps into the open source Eclipse and Microsoft Visual Studio development tools.

Another secret sauce also revealed at SC is a set of new InfiniBand and Tesla drivers that InfiniBand chip maker Mellanox and Nvidia have cooked up to streamline the movement of data from the InfiniBand ports, to the CPU's main memory, and then down through the PCI-Express bus to the GPU card.

According to Keane, the way it works now, data comes in over InfiniBand, works its way into main memory and is copied; before it is moved down to the GPU, it is copied again and that copy is what is moved. The driver changes allow for the data moved into memory to be moved down to the GPU in one fell swoop, and on early tests on clusters that use GPUs and InfiniBand together, Nvidia and Mellanox have been able to demonstrate a 30 per cent speedup for applications.

Ideally, said Keane, you want the data to move direction from InfiniBand to the PCI-Express bus and on out to the GPU memory, where the data processing actually takes place. The CPU is relegated to a traffic cop, and only gets data in its memory when the application requires it to do some processing. This capability is not available yet, and Keane didn't say when to expect it, either.

Finally, Nvidia this week released the beta of the CUDA Toolkit 3.0, which exploits the Fermi GPU's features. ®

The Power of One eBook: Top reasons to choose HP BladeSystem

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
SHOCK and AWS: The fall of Amazon's deflationary cloud
Just as Jeff Bezos did to books and CDs, Amazon's rivals are now doing to it
BlackBerry: Toss the server, mate... BES is in the CLOUD now
BlackBerry Enterprise Services takes aim at SMEs - but there's a catch
The triumph of VVOL: Everyone's jumping into bed with VMware
'Bandwagon'? Yes, we're on it and so what, say big dogs
Carbon tax repeal won't see data centre operators cut prices
Rackspace says electricity isn't a major cost, Equinix promises 'no levy'
Disaster Recovery upstart joins DR 'as a service' gang
Quorum joins the aaS crowd with DRaaS offering
prev story


Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Application security programs and practises
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
Securing Web Applications Made Simple and Scalable
Learn how automated security testing can provide a simple and scalable way to protect your web applications.