Nvidia welds together ARM-Kepler ceepie-geepie system for the impatient
Development board for CPU-GPU hybrid apps, or nodes in a parallel cluster perhaps
GTC 2013 Graphics chip maker Nvidia is also an ARM processor maker, and it wants hybrid ARM-GPU chips just as much as you want them. And in the meantime, if you just can't wait, the company is working with Italian electronics manufacturer Seco to kick out another ceepie-geepie card that can be used for software development or to build a parallel system if you want to play around.
The Kayla machine is not just interesting in its own right as a software development platform, but perhaps for experimental clusters that set the stage for future machines based on future Tegra processors with beefier ARM cores and beefier GPUs on the same piece of silicon.
Kayla definitely sets the stage for the future "Logan" Tegra 5 processors, due next year perhaps. The Logan chips will marry a Tegra ARM processor complex with a Kepler GPU, while the "Parker" Tegra 6 chip will sport a custom 64-bit ARMv8 processor and a "Maxwell" GPU all on the same chip.
The Kayla board is a kicker the the Karma board launched at the SC11 supercomputing conference in 2011. And obviously it could also be used to build a prototype ARM-GPU supercomputer, as the Barcelona Supercomputing Center in Spain has been doing for a few years now using a mix of different ARM processors.
Nvidia is taking hybrid computing another step forward with its Kayla card
Jen-Hsun Huang, co-founder and CEO at Nvidia, showed the Kayla board running real-time ray tracing on a set of wine glasses. "The entire modern software stack we know of - CUDA 5 with OpenGL 4.3 geometry shaders, with physics, running Linux - all on the same platform."
Huang added that Nvidia will be able to take what now is a pretty hefty baby system embodied in the Kayla hybrid system and shrink it down to the size of a dime with the Logan CPU-GPU chip. That will be very interesting indeed, particularly if the Tegra 5 has 40-bit memory addressing and a reasonable number of ARM cores and a Kepler GPU with a whole lot of CUDA cores.
But back to the Kayla ceepie-geepie system. You would think that Nvidia would be using the latest Tegra 4 chip in this board, but it is in fact based on a Tegra 3 processor card, just like the predecessor Karma system made by Seco. The other details of the system were not divulged.
But we can do a little guessing about Kayla based on Karma. With the Karma board, the Tegra 3 ran at 1.5GHz and delivering around 6 gigaflops of number-crunching power; it had 4GB of main memory on the board and a Gigabit Ethernet port.
The Karma system plugged in a GeForce GT520MX mobile GPU coprocessor, which had 48 cores (only 9 percent of the total cores actually on the chip activated) spinning at 900MHz (instead of 1.3GHz); that GPU delivered around 142 gigaflops of floating point oomph.
You can see now why you want GPUs doing the math for ARM processors.
It stands to reason that Nvidia will once again use a GeForce mobile discrete GPU adapter for the Kayla system, but for now the company is not saying which one it is plugging in.
The Karma system had a rumored street price of $500, and that sounds like the right price point two years later for the Kayla system. The Kayla hybrid machine will ship sometime in the spring. ®
The older system was CARMA, for CUDA on ARM Architecture.
The announced development systems are internally named Kayla, and the CARMA name is being dropped.
The two announcements are both "Kayla" devkits. The first is similar to the original CARMA, which had a MXM 3.0 GPU and Q7 processor module on a carrier board powered by a single DC power rail. Now, with the GPU updated to a Kepler class GPU, it's named the 'CUDA on ARM MXM devkit'.
The second system is a new mini-ITX carrier board that supports a Q7 processor module and has a PCIe slot. It uses a ATX power supply, and can run much more power hungry GPUs. Although strictly speaking it's "Kayla" when uses the same new GPU as the MXM module version.
The original devkit was developed around a existing Quadro 1000m MXM module with a GF108 Fermi class GPU. The GPU has 3 SMs, or 96 CUDA cores. The Q1000m has 2GB local memory. Only a portion of that can be mapped into the ARM's address space at one time.
The new devkit uses a Kepler class GPU with 2 SMX units ("SM35") for a total of 384 CUDA cores. Right now it's configured with 1GB of GDDR5 memory.
For both, the CPU module remains based on the Tegra 3. Neither newly announced Tegra 4 products (Tegra 4 and Tegra 4i) have PCIe interfaces. That's why this is a "close development model" rather than exactly the same as Logan.
BTW, the CPU module has 2GB of low power DDR2, and the GPU has 2GB of local memory. While the total is 4GB, only about 3GB is directly addressable. 2GB is pretty much the maximum main memory configuration of ARMv7, due to some sparse utilization of the memory map. Plus you have about 1GB of address space into which you can map PCI devices.
The A15 has a PAE feature to add a few address bits, but it's new, not really used and doesn't help most ARM use cases. The real fix for the cramped address space is ARMv8.
If the intent is massive compute clusters....
If the intent is massive computing clusters, and not just Angry Birds 3D, then they need a way for those clusters to communicate all that data into and out-of the chip. I hope they use something like sRIO.