GPU-flingers' bash: Forget the Matrix, Neo needs his tensors
What's a tensor? Glad you asked...
Posted in HPC, 24th May 2017 08:01 GMT
HPC blog Last week, Nvidia held its biggest ever GPU Technology Conference (GTC). The big walk-away is that GPUs are rapidly becoming an expected and standard component of computing, table stakes in many cases, across the computing platform. Big deal right there and hence the frothiness of much of the coverage.
Before the conference, I shared the top-4 questions that I was hoping to get answered:
This also shows how popular I am on twitter, with one retweet and three likes.
Here’s a quick analysis of how the questions fared.
Q1. What's next after P100?
The P100 was only announced last year and it was quite a feat by itself. People are still clamoring for it if the company’s financial performance over the past year is any indication. So, slideware and a few macho specs about the future would have probably carried the day at GTC17. Would we get more? Well, NVIDIA did not disappoint. We saw actual silicon and detailed spec and real benchmarks: which is a very decisive answer to “so, what comes next?” question.
The V100 (V for Volta) is coming by Q3 of this year and it’s a nice clear step above the P100 (P for Pascal). They follow Kepler and Maxwell, nicely ordered alphabetically. Overall, it’s about 1.5x faster than P100 except for Deep Learning kernels for which it is a whopping 12x faster on paper, thanks to the new Tensor cores that specifically target AI workloads. Overall, 5x faster than P100 is what is projected/claimed, which is a much more practical target.
Weighing in at a robust 815mm2, oodles of flops of various kinds, and 300W, it’s big, it’s fast, and it’s hot, though it packs enough punch to be one of the most energy efficient chips out there for what it delivers, and it can be used in performance-first or energy-first modes to optimize for one or the other.
Here’s a quick comparison:
Remember that the taxonomy of computing speed goes from “on paper, but you’ll never see it”, to “guaranteed not to exceed, but if the stars are aligned you might see it”, to “possible, if you optimize things well”, to “typical, but it could be lower”. And in general, you’d better look at the minimum speed as much as you look at maximum speed.
For the workloads that are fast emerging, and the optimized frameworks and system software that is available, GPUs and other forms of what we call High Density Processing (HDP) is the way to go.
Our refrain these days is: digitization means lots of data, and making sense of lots of data increasingly looks like either an HPC problem or an AI problem. The chip targets that sweet spot.
So what’s a tensor?
A single number is a “scalar” (zero dimension, or “index”). A row of numbers is a “vector” (1 dimension, or index). A 2-dimensional row-column of numbers is a “matrix”. A Tensor is just a generalized definition of such mathematical objects, an n-dimensional object that follows specific transformation rules.
In Deep Neural Nets (DNN), you get to have layers and layers of “neurons” with coefficients that must be calculated and aggregations that must be tracked, and all of that can be nicely abstracted into tensors. Tensors are common language in physics, relativity, fluid mechanics and such, but their use in AI make them fresh territory in IT verbal landscape. Pretty sure most of the URLs are taken already!
And what about Moore’s Law?
Nvidia bills the chip as providing a 5x improvement over P100/Pascal in peak teraflops, and 15x over the M40/Maxwell which was launched two years ago. Do the math and yep, it’s better speed improvement than Moore’s Law, in fact more than 4x better.
Why/how is that, you might ask. And the answer is pretty much this: we used to gain speed by improving frequency, doing the same things at a faster clip and devoting more and more on-chip circuitry to helping one CPU be faster. Like building an ever more opulent palace. That all changed when multi-core and then many-core and now kilo-core chips came along. Instead of that palace, people started building condos and hotels. And as long as there are enough threads and tasks in your app to keep it all occupied, you get better throughput and faster turn-around.
With 5,120+2,560+640=8,320 cores of various types, the V100 is an 8 kilo-core chip. Bytes are way ahead but, yeah, cores too can be counted like Bytes can.
Q2. What's new on client side?
Nvidia rolled out a new deskside beast, the DGX Station, which packs 4xV100s. At 1.5KW, you’d expect it to come with a big noisy fan, but the box is liquid cooled. Closed loop so you don’t have to call the plumber, but it makes it nearly noiseless. At about $70k, it’s not quite your average “client” machine, it’s more of a “laptop of the gods”! Personal AI Supercomputer is how it was billed. But it looks like a workstation and counts! We didn’t notice any news on a follow on to the GeForce® GTX 1080 Ti, Nvidia's flagship gaming GPU, based on the Pascal GPU.
Q3. What's up with OpenPower?
We still think the real battle in server architecture is between Intel’s in-house coalition and what has come to be known as the Rebel Alliance: IBM’s OpenPower industry coalition. Intel has its all-star team: Xeon Phi, Altera, Omni-Path (plus Nervana/Movidius), while OpenPower counters with a dream team if its own: POWER, Nvidia, Xilinx, and Mellanox (plus TrueNorth). The all-in-house model promises seamless integration and consistent design, while the extended team offeres a best-of-breed approach. Both camps are pretty formidable. Both had merits. And there is real differentiation in strategy, design, and implementation.
Last year, the Rebels held their event with GTC. Not this year. And despite our continuing enthusiasm for some solid competition, and market checks that seem to indicate the Rebels are doing quite fine, we haven’t seen as much of OpenPower this past year as we had expected.
So it was quite reassuring to see the V100 come with a way faster NVLink interconnect technology. The second generation NVLink moves data at 300 GB/s. That’s 6 links each at 25 GB/s, equalling 150GB/s with a 300GB/s total out + in data rate.
What is also significant here is the improved scalability for multi-GPU/CPU configurations. NVLink supports CPU mastering and cache coherence with IBM Power 9 CPUs. That’s a pretty big deal and a nice boost for the Rebels.
Q4. What's the plan to keep ahead of new AI chips?
The competition in AI chips is heating up and we expect to see several new AI chips and architectures to show up in the coming months. They’re really optimizing for AI workloads which means lower-than-64-bit arithmetic and multiply-accumulate (MAC) instructions, multiple pipelines, separate integer and floating-point paths, related register/memory design, etc.
No doubt, you noticed the new Tensor cores in the V100 and wondered what it is. Each tensor core can do 64 multiply-add ops per cycle. It multiplies 16-bit numbers into 32- bit intermediates and adds them to 32-bit numbers, resulting in 32-bit numbers. That's 2 FP ops per cycle in mixed precision. There are 640 of them (8 per streaming multiprocessor(SM) and there are 80 of those) all running at 1.455 GHz so 64*2*640*1.455=119+ TFlops, so that’s where the 120 Tensor TFLOPS comes from.
Now, the P100 was pretty beastly for AI work and this just raises the game in a big way. Could you use even lower precision and make it go even faster? Yes, you could. But as you reduce precision, you’re going to need either well-behaved problems or more neurons and more layers to make up for it. And the V100 is still good for a lot of other workloads, which is a clear difference between it and the chips that go all the way with AI and, in the process, make themselves unsuitable for other workloads.
Nvidia also announced an immersive collaboration suite called Holodeck, showed off how deep learning an help out in complex ray tracing graphics to improve resolution, driverless car technology that can assist the driver, some cool work with AI software stack and containers, an updated DGX server now with 8xV100s, and HGX-1, a server for cloud computing that can easily vary the ratio of CPUs and GPUs that can be provisioned.