Original URL: http://www.theregister.co.uk/2013/03/22/arm_gpu/

Calm a CARMA drama chameleon: Barça super waves ARMs, GPUs

Cluster perks up with low-power CPUs bossing graphics chips

By Dan Olds, Gabriel Consulting

Posted in HPC, 22nd March 2013 13:26 GMT

GTC 2013 Over the last few years, we’ve seen a steadily growing buzz surrounding the use of ARM processors in PCs, servers and supercomputers.

Here, at this year's GPU Technology Conference in California, that buzz is even more pronounced. This is due to Nvidia's upcoming 64-bit "Project Denver" ARM cores, and advances in its graphics chips to make machines even less dependent on a fast and powerful (read: Intel Xeon) processor feeding data to number-crunching GPU beasts. El Reg's Rik Myslewski penned a great article on GTC 2013 ARM chatter here.

While everyone has been debating and speculating about what it would be like to combine Brit-designed ARM cores and GPU accelerators, one organisation has put together some hardware in order to separate the theoretical from the real. The Barcelona Supercomputer Centre (that's Barcelona in Spain, not the other one) is building clusters to explore the potential advantages that might arise from combining nippy ARM-compatible chips with fast number-crunching GPUs.

The centre's first attempt, the Tibadabo, was a proof-of-concept system to determine whether it’s possible to build an all-ARM-based cluster. Could they really put together a cluster based on a low-power processor family that's ideally suited to mobile phones, hard drives and handheld games? And, if they could build it, could they find or adapt enough software for it to do useful work?

They were able to construct a two-rack cluster containing 32 blades, 256 nodes, and a total of 512 Tegra 2 ARM cores. They were able to port 11 scientific apps over to ARM's architecture with little difficulty, although they did need to fiddle around with the memory hierarchy to optimize some of the apps.

The performance wasn’t all that great. The total system turned out 512 billion floating-point calculations a second (512GFLOPS) while consuming 3.4kW, yielding 0.15GFLOPs/watt. For context, the best systems on the most recent Green-500 list - the top 500 supercomputers ranked by energy efficiency - come in around 2.4 or 2.5GFLOPs/watt; the systems at the end of the list are rated at 0.033GFLOPs/watt.

What's the world CARMA to?

So the Spanish brainiacs went back to the drawing board and clustered 16 Nvidia CARMA* development boxes as a learning experience they called Pedraforca v1. This system did much better than the ARM-only Tibadabo on energy efficiency, yielding .78GFLOPs/watts while running the DGEMM matrix-multiplication benchmark, so they were making progress.

Limitations in the platform (such as the max speed of 400MB/s over the PCIe bus plus an inability to overlap computation and data transfers) meant it couldn’t scaled up very well. However, it did lead them to a new breakthrough in their thinking for their next system, which they’ve dubbed Pedraforca v2.

They’ve decided the key to building a highly efficient system isn’t to erect an accelerated cluster but to build a cluster of accelerators. While there isn’t much difference in the words, there’s a world of difference between the meanings. For Pedraforca v2, they will decouple the CPUs from the GPUs, meaning that the ratio of general-purpose cores to graphics processor cores can be changed to fit the workloads. They will also use direct GPU-GPU data transfers via Mellanox’s ConnectX-3 Infiniband interconnects.

This will take a huge amount of latency out of the system and, accordingly, reduce the amount of work the CPU needs to do to orchestrate GPU communications. The prototype system will have 64 nodes, each sporting a quad-core Tegra 3 CPU at 1.3GHz that will slide into a 4x PCIe slot on a Mini-ITX carrier. In this configuration, the CPU will only be managing boot and MPI communications, plus minimal traffic cop duty for the GPUs. The point is that you don’t need a hugely fast and powerful processor to fulfill these requirements.

However, Pedraforca v2 will have some processing power in the form of Kepler-based Nvidia K20 GPUs that can deliver 1,170GFLOPS through a PCIe Gen 3 slot. The GPUs will be able to communicate with each other at 40Gbps via the aforementioned Mellanox-fuelled Infiniband interconnect.

The chaps presenting this tech at GTC 2013 pointed out that this isn’t a general-purpose HPC system – it is intended as a host for apps that are GPU-optimised. While they didn’t discuss any FLOPS/watt estimates or performance predictions, it’s safe to say that Pedraforca v2 should be an eye opener when it comes to energy efficiency and even cost per FLOP. It’s definitely a project worth watching. ®

* CUDA on ARM architecture.