Nvidia Tesla bigwig: Why you REALLY won't need x86 chips soon
Find out where Intel, AMD, ARM stand in GPU giant's roadmap
Interview Life is what happens when you are trying to do other things, as the old saying goes.
Jen-Hsun Huang, co-founder and CEO of Nvidia has been perfectly honest about the fact that the graphics chip maker didn't intend to get into the supercomputing business. Rather, it was founded by a bunch of gamers who wanted better graphics cards to play 3D games. Fast forward two decades, though, and the Nvidia Tesla GPU coprocessor and the CUDA programming environment have taken the supercomputer world by storm.
Nearly two years ago, Steve Scott, the CTO at supercomputer maker Cray who designed several generations of supercomputers and interconnects, joined Nvidia to be CTO of its Tesla GPU coprocessor unit.
TPM (Vulture Central's interviewer): You know the kinds of things that El Reg cares about. We care about the GPU computing roadmap, we care about Project Denver ARM cores, and we care about servers and connectivity and the kind of role that you might play in both.
Steve Scott: Well, let me start with this: Jen-Hsun talked about it already, the fact that with "Logan" Tegra is going to become GPU-capable. What we are seeing is a convergence between what Tegra is doing and what Tesla is doing.
Tesla could never do what we do without GeForce. The HPC community is just far too small to support the kind of development it would take to build a competitive processor. This is why Cray got out of building processors, and recently Cray even got out of building interconnects, because the HPC market just isn't big enough. So Tesla has been completely enabled by GeForce. The cool thing that is going on right now is that we are going to see that extend from GeForce to Tesla to Tegra. All of the GPU computing, all of the software – CUDA, OpenACC, and anything else in the software stack – is now going to work on phones, tablets, laptops, whatever. On the other end of the spectrum, Tesla is becoming more Tegra-like because we are going to start integrating CPUs into Tesla.
TPM: Will they be literally the same processors used in Tegra and Tesla, or will there be variants of the Denver core aimed at mobile and server devices?
Scott: Our Denver project is really aimed at putting out a high-performance ARMv8 processor. Our Denver 64-bit ARM core will be higher performance than anything you can buy from ARM Holdings. That core is going to show up in Tegra, but it won't show up in all of the Tegra processors. We will still have Tegra processors that use stock ARM cores as well, like we use Cortex-A9 cores today, but Denver will show up in the high end.
As an architecture licensee, the thing to remember is that you can tweak an ARM core to change its performance, but you can't change the architecture one lick. You have to conform to the ISA, and they are quite disciplined about that.
TPM: Can a full licensee, like you are, add to the ISA?
Nvidia Tesla CTO Steve Scott
Scott: No. You can add system-on-chip features, you can put on a video transcoder, for instance, or a different network interface, but the instruction set is the ARM-approved ISA and that is it. What that means is that you can compile any software and it will run on any ARM core, and it's a good thing.
TPM: So how do we think about Tesla and Tegra going forward? Will there always be a Tesla, or does Tegra just get fatter and faster?
Scott: As the CTO of Tesla, I sure hope so. (Laughs)
TPM: I would think you would know ... er ... (More laughs)
Scott: There are no current plans that I am aware of to do away with Tesla. But seriously ... the products don't have to converge, but we do end up with a converged architecture. This is really unprecedented. The right way to build a consumer processor has, historically, not been the right way to build a supercomputer processor. Go back in time and look at a Cray 1 and a Motorola 6800 or MOS 6502. They were just night-and-day different. Go back even five years and they were quite different.
But now everybody is constrained by power – the most important thing in your phone is power efficiency because you have got one watt, and the most important thing in a $100m supercomputer is power efficiency because it is getting expensive to plug the things in. So the right way to build supercomputers in the future is going to be with lots of little power-efficient cores. You are going to have to do that to get the efficiency. You don't really want to build a really complex processor for a supercomputer, even though that would give you faster single-thread performance, because it is going to become power inefficient.
So the question becomes, do you have lots of little cores, or lots and lots of little cores? Do you have 100 cores, or do you have 1,000 cores?
TPM: You already do a lot of differentiation with the GPUs today. You scale up and down the number of CUDA cores, the number of SMs, and the memory, and so forth – and decide what features to turn on and off in each product, whether it is visualization or dynamic parallelism or Hyper-Q, how much single-precision or double-precision math, or whatever. So there will be similar differentiation between Tegra and Tesla?
Scott: It's going to be stuff around the edges – what kind of network interface, how much memory bandwidth you have, do you put in ECC, do you put in a high-throughput double-precision floating point. The answers to all of those questions is yes for HPC, and no for the mobile space. But the architecture – which means taking some heavy-weight cores that are designed for single-thread performance and coupling those with lots of cores that are designed for power efficiency – are the same between the two.
So we can now develop the "Maxwell" family of GPUs, and that will go into the Tesla line and into the "Parker" family of Tegra processors. Just like today we make a GK104, a GK107, and a GK110 SKU of the same basic architectural family. The interesting bit about dynamic parallelism in particular is that we won't regress. Any future Tesla GPU will have it, even though it was first introduced only in the GK100 used in the Tesla K20.
(Scott did not know when dynamic parallelism might be added to future Tegra CPU-GPU hybrids.)
The point is, once you make Tegra compute-capable and once you integrate GPU cores and have the same basic architecture, we now actually have what the cute little phrase – "from cell phones to supercomputers" – promised. It means we have higher volume, which provides the foundation for Tesla.
TPM: At the moment, Tesla is basically a slightly modified GeForce card aimed at servers and workstations. But going forward, it is still going to be a little different in that with both Tesla and Tegra, you are going to have both CPUs and GPUs on a single die. . . .
Scott: They don't necessarily have to be on a single die. You could get the same effect by having a custom interface between the two.
TPM: As Intel does with Xeon E3s and the HD graphics in the same package, yup. Although in your case, you would be not dropping in or not turning on the CPU where Intel is enabling or disabling the graphics, depending on if it is aimed at a workstation or a server.
Scott: And eventually, depending on how successful ARM is in attracting games, you could imagine game machines in the future – a gaming PC – with an ARM plus GPU instead of X86 plus GPU.
TPM: I have been thinking for some time that you would get into the console business, and then servers proper, and then maybe PCs - whatever that term might mean in the future - aside from a smartphone and a tablet.
Scott: The point is, you can serve all of our current businesses with the same architecture. But we will continue to interoperate with Intel and AMD processors – that's important to us and we will continue to do that – and there will be overlap. But you can imagine a future where you really don't need an X86 processor anymore because we have an integrated ARM processor.
What's the difference between Tesla and Tegra?
TPM: For the HPC community, what will be the practical difference between Tegra and Tesla? What is to keep supercomputer shops from trying to build Tegra supercomputers out of those future Parker chips, or even Logan?
Scott: Tegra will never have a good network interface because it doesn't need one, and Tegra will not have the same amount of memory or bandwidth into the memory subsystem that Tesla has. At some point, you might have stacked memory in Tegra, like we are planning to do with the "Volta" Tesla chips, but it will be much smaller.
Tesla will never have high enough volume to justify the bulk of the engineering work that it takes to do a full solution. But it will have enough volume to justify the incremental engineering work that is necessary to take the consumer parts and make supercomputers.
TPM: How beefy will you get those Denver cores used in the Tesla products? Will it be enough to get rid of the X86 entirely?
Scott: That's the goal.
In terms of what you can do with it, there is really no difference between an ARM ISA and an X86 ISA. The ARM ISA is a little cleaner in terms of RISC, but an X86 processor is really just a RISC processor with this wrapper around it that converts X86 into a RISC ISA. I am very happy that most of the world thinks that ARM provides a great efficiency advantage over X86. The truth is, it really doesn't. It is a small second-order effect. The reality is, you squint and there is no power advantage to ARM.
The advantages that are important is that ARM is open, ARM is much higher volume, many more people are using it, and there is a lot of opportunity and ability to innovate. Historically speaking, that tends to win. It is the classic Innovator's Dilemma. I expect for ARM to do to Intel what Intel did to RISC and mainframes. You can't say for sure that it will happen, but again, volume (which means you can get by with lower margins) and openness and lots of people innovating should win.
TPM: Can you plunk a future InfiniBand port onto the Tesla package as well? Is that desirable? Can you put a ConnectX adapter or a whole switch, or a piece of a distributed switch like Calxeda is doing with its ARM server chips, down there?
Scott: You certainly could.
I think you get a lot of benefit from just doing the NICs. There are some pros and cons to trying to integrate the router as well, and it has to do with the ability to build different strengths of networks. It is easier to build fat or skinny networks this way if the ratio of processor to router silicon is not baked in. You also get pass-through traffic then, and you are using processor pins to route packets that are coming into the processor, and then out again, but are not terminating or sourcing at the processor. You are burning pins on your processor, which is fine if you always know what your configuration is going to be.
I am not going to tell you precisely what we are going to do, but it is a story of integration. We are looking out, later on in this decade, at a world where the third party HPC network ecosystem may go away.
If you look at the current Top 500, you have a bunch of Ethernet systems that are all at the bottom of the list and they have crappy efficiency, even on Linpack. Of the credible HPC systems, you have InfiniBand and custom networks from Cray, SGI, IBM, and a few others. Looking at that landscape going forward, QLogic is off the table and Cray's network is off the table because Intel has that and Cray is getting out of networks. BlueGene is off the table, from everything I understand, and there is no future BlueGene roadmap. The Tofu network on the big K machine from Fujitsu doesn't really have a commercial future. SGI is currently doing a custom network, but I don't know how long they will be able to continue to do that.
So what is left? It is Mellanox. And what does Mellanox do? Basically, it hooks up Xeon servers into clusters. And if Intel goes to a proprietary integrated network fabric, that doesn't mean that Mellanox's days are numbered – I don't want to give people the wrong impression here – but there is certainly a threat.
TPM: That's how I see it, too. And have said as much. Intel is not just buying up Fulcrum Microsystems, QLogic, and the Cray interconnects as defensive maneuvers to keep them out of enemy hands, but because it wants to build something and drive it down into the chips and into the switch ASICs that it clearly wants to sell to hit its 2016 target of $20bn in sales for the Data Center and Connected Systems group.
(I didn't think of it at the time during this interview, but maybe Nvidia should buy Mellanox and get it over with, keeping it in neutral territory?)
Scott: We are looking at a landscape where there may not be good third party networks available to interconnect GPU. We are also looking at selling processors, not just being an accelerator to other processors, so we need a network story.
TPM: Something like the Echelon development project you did for DARPA? That looked an awful lot like the Cray "Aries" interconnect in there.
Scott: The Echelon plan was to have an integrated NIC and a lot of bandwidth coming off the processor supporting a global address space across the machine – native loads and stores everywhere, and all of the synchronization that works between cores on a chip works seamlessly between cores on different chips. So you have got this very tightly integrated network fabric. And yes, our vision is a dragonfly network, same as Aries. The details are different, but there is a dragonfly topology.
TPM: I keep coming back to this, but do you have to do this yourself or pray that someone else figures it out?
Scott: That's a great question. At this point, we are just starting to think hard about this. When I was working at Cray, I was working with Bill Dally of Stanford University on Aries, and we are both here at Nvidia. There were two architects of the Aries router, myself and Mike Parker, who is now a senior research scientist at Nvidia. We clearly have the ability.
The question is, what do we do about it? It is not something we have figured out completely. We have ideas, and it is not something we are talking about, but we will make sure that there are good network solutions for future processors and you can imagine that there will be tighter coupling to the processor. We are talking to potential partners, too, since we really don't have any aspirations to be a standalone systems company.
TPM: And yet, Nvidia just announced the Visual Computing Appliance, so I don't necessarily buy that. (laughs)
Scott: Well, that's a specific appliance. . . .
TPM: But seriously. Sometimes you don't have the choice. Look at the choice that Cisco Systems faced when all of the server vendors started snapping up networking companies to expand their TAM and because they saw convergence and network virtualization coming. Cisco could either leverage its networking business to come into servers or lose market share. So there may come a day, because the world is a tough place, when the world may say to Nvidia that it can be an HPC systems company or not be invited to the party.
The trick will be to make whatever you do up there applicable to enterprise computing and hyperscale cloud operators like Facebook and Google. Don't pull an IBM and create BlueGene/Q and then not realize that it could be a killer microserver for running Hadoop jobs with a big price cut and some re-engineering so it fits in standard racks - or, I guess, Open Compute racks. (Laughter)
Scott: I would be delighted if the stuff that we are creating for HPC will be useful for general purpose data centers. We definitely think about such things. And from a networking perspective, a lot of things that are good for HPC are also good for scale-out data centers.
The systems that Google, Amazon, Facebook, and others are fielding are bigger than HPC supercomputers. And as they get rid of disks and start trying to run everything in memory, all of a sudden the network latency is starting to matter in a way that it didn't. They care about global congestion and communication, with jobs like MapReduce, and the amount of bandwidth per node is small compared to HPC.
But if you build a network right, it is sliced anyway. Both networks will have good global congestion control, good separation for different types of jobs, and good global adaptive routing – all with low latency. ®