Original URL: http://www.theregister.co.uk/2008/06/16/nvidia_tesla_10/

Nvidia blows out Moore’s Law with fresh Tesla

Insane horsepower for the HPC geek on the go

By Ashlee Vance

Posted in Servers, 16th June 2008 18:11 GMT

Nvidia pitches its Tesla hardware as a magical solution for the world’s toughest computing problems. Just move your code that runs well across many processors over to the Tesla boards, and Shazam!. You enjoy sometimes 400 per cent improvements in overall performance.

Despite such mind-blowing increases in horsepower, Tesla continues to occupy a space that one could characterize as ultra-niche. Only the brave few have navigated Nvidia’s CUDA programming apparatus to tweak their code for the general purpose graphics processors inside of the Tesla systems.

That ultra-niche, however, may grow into a niche over the coming year thanks to the introduction of more powerful Tesla systems.

Key to the release today of the Tesla-10 Series processor is the presence of 64-bit, double-precision floating point support. This upgrade lets Nvidia take better care of high performance computing customers – those who make heavy use of mathematical operations – who will likely drive Tesla’s early success.

The Tesla-10 Series chip ships with 240 processing cores – up from 128 cores in the previous product. Although, these are not the beefy cores associated with general purpose chips made by Intel, AMD and others. Instead, they’re little babies that have previously just handled graphics jobs.

Overall, the new chip boasts 1.4bn transistors and 1 Teraflop of computing muscle.

That 1 Teraflop figure is up from half a Teraflop with the older Tesla 8 chip. In addition, the new Tesla chip kicks memory support up to 4GB from 1.5GB, and that’s again a key leap forward for placating the HPC crowd.

The base unit inside of a Tesla chip has been dubbed a Thread Processor Array (TPA). The TPA consists of eight cores, which all have access to a shared memory bank. Nvidia then combines 30 of the TPAs to make a full Tesla 10 chip.

Those customers looking to get into the Tesla game have a couple of system options. Nvidia has rolled out the S1070 box, which is a 1U unit that contains 4 of the Tesla 10 chips. So, that’s 960 cores running at 1.5GHz, reaching 4 Teraflops of performance. The system also holds 16GB of memory, has peak memory bandwidth of 408GB/sec and consumes 700 watts.

Comparison slide of Nvidia's old and new Tesla gear

Tale of the Tesla Tape

You’ll need to connect the S1070 to a host server with a general purpose CPU via a pair of PCIe Gen2 cables.

If an entire box isn’t your thing, then Nvidia offers up the C1060, which is a cigarette carton-sized device that plugs into the PCIe slot on a motherboard. This puppy holds a single Tesla 10 chip clocked at 1.33GHz, has 4GB of memory and eats up 160 watts. It also has an on-board fan, which is a bit of worry if you think about packing a cluster full of these systems. Damn those moving parts!

To fully appreciate the target market of Tesla systems, one need only look at the performance comparisons provided by Nvidia between the Tesla 8 and 10 systems. On an algorithm having to do with Dynamics of Black Holes, for example, the new unit runs four times faster than its predecessor and a whopping 84 times faster than a CPU. And, if you’re into Cholesky Factorization, then the Tesla 10 unit shows much more dramatic scaling than the Tesla 8 units while also coming close to quadrupling performance.

Now, Nvidia CEO Jen-Hsun Huang might have said that the processor is dead, but the company really seems to see regular CPUs living alongside these GPGPU systems. Nvidia talked to us an awful lot about heterogeneous computing where the CPUs handle some tasks, and GPUs take on those specialized, parallel tasks that can map well onto the weird silicon.

And it’s that mapping that is really the heart of the matter around GPGPUs.

The knock on all of the major accelerator options, including GPGPUs, FPGAs, Cell chips and specialized silicon from a company like ClearSpeed, is that they’re too funky for use by many applications or developers.

Nvidia does its best to counter the software skeptics with CUDA – a development environment based on C that helps push certain jobs onto GPGPUs.

During a recent all day meeting at Nvidia’s headquarters in Santa Clara, the company rolled out a number of customers who have faced CUDA and won. These folks did stuff like plasma and radiation modeling and oil and gas exploration. The consensus seemed to be that it takes about a month to learn the CUDA nuances and tweak code for the GPGPUs.

Moving forward, Nvidia plans to invest in Fortran, C++, GPU cluster, profiler and debugger aspects around CUDA.

It must be said that the market for Tesla and other similar accelerators appears quite limited for the foreseeable future. No matter how easy the hardware guys make it sound, coding for these things requires some software savvy, and only parts of applications will lend themselves to the accelerators.

That’s why you hear Intel banging on about Larrabee – the x86-based, many-core, GPU-like product it’s meant to ship in 2010. Intel claims that its compiler will do a lot of the dirty work, pushing the right bits of code onto the accelerator. In addition, developers start out in familiar territory with x86 instructions.

But, er, Larrabee is just slideware for the moment, and it’s hard to win developers’ hearts and minds even if you give the best slide.

So, for the time being, it’s Nvidia marching on and telling a pretty decent story. The application boosts shown with Tesla blow out anything you could expect to receive from Moore’s Law and advancing CPUs. Rather tellingly, it’s not easy to buy Tesla systems from the usual server top dogs. HP has some kind of buddy relationship with Nvidia around the boxes, and start-ups like Acceleware will offer the gear with their software. Things get a bit trickier after that. There’s more information here. ®