Nvidia previews next-gen Fermi GPUs

The supermodels of HPC: hot, and worth it

Combat fraud and increase customer satisfaction

SC09 Graphics chip maker and soon-to-be big-time HPC player Nvidia raised the curtain a little higher on its next-generation of graphics co-processors at the SC09 supercomputing trade show in Portland, Oregon, this week, and it is arguable that the GPU co-processors aimed at personal supers and massive clusters alike were the star of the show.

The next-generation GPU co-processors were developed under the "Fermi" code-name and the details of which were previewed by El Reg last month and featured in a future hybrid supercomputer deal at Oak Ridge National Laboratory.

(Oak Ridge is, of course, home to the Jaguar massively parallel Opteron-Linux super built by Cray and currently the top of the Top 500 super charts. While Jaguar does not use GPU-co-processors, it very well could before too long; Oak Ridge has been vague about its GPU plans.)

The naming conventions of the video card versions of the Fermi chips will be called the GeForce 300 M line, as we reported at the end of October. And at the SC09 event in Portland, Nvidia announced that it is keeping the Tesla brand for its next generation GPU co-processors for workstations and servers. The Fermi chips will be sold under the Tesla 20 brand, as it turns out.

Flavors - and then some

According to Andy Keane, general manager of Tesla supercomputing at Nvidia, the Tesla 20 cards will come in two flavors and the company will sell co-processor systems that can plug right into HPC clusters and link to servers through PCI-Express 2.0 links - and at around 130 watts. Keane bristles at anyone who claims that a fully burdened heat budget for a server - not just a microprocessor, but its memory controller (if it is not integrated), its chipset, and its memory - will be any lower.

With the Fermi family of GPUs, Nvidia is adding L1 and L2 caches to the co-processors and is putting ECC memory scrubbing on internal GDDR5 video memory on the card as well as accesses to external server memory. This ECC support, as it turns out, is as important as anything else in the chip if you want to sell GPUs to nuke labs.

They can't have memory errors crash an application that may take weeks or months to run and they have to trust the answers they get. (IBM's Cell co-processors, used in the number two "Roadrunner" Opteron-Linux supercomputer installed at Los Alamos National Laboratory, have error correction for their memory. But as far as I can ascertain AMD's Radeon graphics cards and Firestream GPUs do not have ECC.)

The Fermi chip has 512 cores, which is a little more than twice the cores of the first Tesla GPUs. The Fermis bundle 32 cores together into a streaming multiprocessor that has 64 KB of shared L1 cache. All 512 cores have access to a shared 768 KB L2 cache, and they support the IEEE 754-2008 double precision floating point standard.

The Fermi chip can, in theory, address up to 1 TB of memory, but the Tesla C2050 GPU co-processor has 3 GB of GDDR5 memory and double precision floating point performance of 520 gigaflops; it has a list price of $2,499. The Tesla C2070 GPU has 6 GB of GDDR5 memory and is rated at 630 gigaflops; it costs $3,999. The bang for the buck is best with the smaller unit, which weighs in at $4.81 per gigaflops compared to the $6.35 per gigaflops of the faster GPU.

The Nvidia Tesla 20 series appliances cram four GPUs into a 1U form factor, with four links out to server nodes. The S2050 uses the slower C2050 GPUs and is rated at 2.08 teraflops and will cost $12,995 when it ships. That works out to $6.25 per gigaflops, so you are paying an extra $2,299 for the server that wraps around four Tesla 20 GPUs.

A 1U appliance with four of the faster C2070 GPUs delivers 2.52 teraflops of double-precision floating point performance and costs $18,995, or $7.54 per gigaflops. By comparison, a prior-generation C1060 GPU with 240 cores and delivering only 78 gigaflops at double precision cost $1,699 when it started shipping in June, or about $22 per gigaflops. (No one really cares about single precision and I am ignoring it.)

The Tesla 20 GPU co-processors and the appliances based on them will be available in the second quarter of 2010, says Keane. The GeForce graphics cards based on the same GPU chips will start rolling out in the first quarter.

GPU for you, sir

There's some other secret sauce in the Fermi GPUs that are going to get HPC nerds thinking about using GPUs.

For one thing, they will support Nvidia's C++ compiler, not just C. Keane dodged exactly when C++ would be ready, and laughed at the idea that Intel's own C++ or Fortran compilers would be ported to CUDA. The Portland Group said this week at SC09 that it has tweaked its popular Fortran compiler to work within the CUDA parallel programming environment that Nvidia created for its graphics cards and co-processors; this Fortran has been in beta testing for about three months.

There are projects that have pulled the CUDA libraries into the popular Matlab tool as well as the R and Python programming languages, and Java applications have been able to be bound into CUDA environment for about a year. The CUDA tool already snaps into the open source Eclipse and Microsoft Visual Studio development tools.

Another secret sauce also revealed at SC is a set of new InfiniBand and Tesla drivers that InfiniBand chip maker Mellanox and Nvidia have cooked up to streamline the movement of data from the InfiniBand ports, to the CPU's main memory, and then down through the PCI-Express bus to the GPU card.

According to Keane, the way it works now, data comes in over InfiniBand, works its way into main memory and is copied; before it is moved down to the GPU, it is copied again and that copy is what is moved. The driver changes allow for the data moved into memory to be moved down to the GPU in one fell swoop, and on early tests on clusters that use GPUs and InfiniBand together, Nvidia and Mellanox have been able to demonstrate a 30 per cent speedup for applications.

Ideally, said Keane, you want the data to move direction from InfiniBand to the PCI-Express bus and on out to the GPU memory, where the data processing actually takes place. The CPU is relegated to a traffic cop, and only gets data in its memory when the application requires it to do some processing. This capability is not available yet, and Keane didn't say when to expect it, either.

Finally, Nvidia this week released the beta of the CUDA Toolkit 3.0, which exploits the Fermi GPU's features. ®

Combat fraud and increase customer satisfaction

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
AMD's 'Seattle' 64-bit ARM server chips now sampling, set to launch in late 2014
But they won't appear in SeaMicro Fabric Compute Systems anytime soon
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
prev story


Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Top three mobile application threats
Learn about three of the top mobile application security threats facing businesses today and recommendations on how to mitigate the risk.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.