The Register® — Biting the hand that feeds IT

Feeds

Calm a CARMA drama chameleon: Barça super waves ARMs, GPUs

Cluster perks up with low-power CPUs bossing graphics chips

Email delivery: Hate phishing emails? You'll love DMARC

GTC 2013 Over the last few years, we’ve seen a steadily growing buzz surrounding the use of ARM processors in PCs, servers and supercomputers.

Here, at this year's GPU Technology Conference in California, that buzz is even more pronounced. This is due to Nvidia's upcoming 64-bit "Project Denver" ARM cores, and advances in its graphics chips to make machines even less dependent on a fast and powerful (read: Intel Xeon) processor feeding data to number-crunching GPU beasts. El Reg's Rik Myslewski penned a great article on GTC 2013 ARM chatter here.

While everyone has been debating and speculating about what it would be like to combine Brit-designed ARM cores and GPU accelerators, one organisation has put together some hardware in order to separate the theoretical from the real. The Barcelona Supercomputer Centre (that's Barcelona in Spain, not the other one) is building clusters to explore the potential advantages that might arise from combining nippy ARM-compatible chips with fast number-crunching GPUs.

The centre's first attempt, the Tibadabo, was a proof-of-concept system to determine whether it’s possible to build an all-ARM-based cluster. Could they really put together a cluster based on a low-power processor family that's ideally suited to mobile phones, hard drives and handheld games? And, if they could build it, could they find or adapt enough software for it to do useful work?

They were able to construct a two-rack cluster containing 32 blades, 256 nodes, and a total of 512 Tegra 2 ARM cores. They were able to port 11 scientific apps over to ARM's architecture with little difficulty, although they did need to fiddle around with the memory hierarchy to optimize some of the apps.

The performance wasn’t all that great. The total system turned out 512 billion floating-point calculations a second (512GFLOPS) while consuming 3.4kW, yielding 0.15GFLOPs/watt. For context, the best systems on the most recent Green-500 list - the top 500 supercomputers ranked by energy efficiency - come in around 2.4 or 2.5GFLOPs/watt; the systems at the end of the list are rated at 0.033GFLOPs/watt.

What's the world CARMA to?

So the Spanish brainiacs went back to the drawing board and clustered 16 Nvidia CARMA* development boxes as a learning experience they called Pedraforca v1. This system did much better than the ARM-only Tibadabo on energy efficiency, yielding .78GFLOPs/watts while running the DGEMM matrix-multiplication benchmark, so they were making progress.

Limitations in the platform (such as the max speed of 400MB/s over the PCIe bus plus an inability to overlap computation and data transfers) meant it couldn’t scaled up very well. However, it did lead them to a new breakthrough in their thinking for their next system, which they’ve dubbed Pedraforca v2.

They’ve decided the key to building a highly efficient system isn’t to erect an accelerated cluster but to build a cluster of accelerators. While there isn’t much difference in the words, there’s a world of difference between the meanings. For Pedraforca v2, they will decouple the CPUs from the GPUs, meaning that the ratio of general-purpose cores to graphics processor cores can be changed to fit the workloads. They will also use direct GPU-GPU data transfers via Mellanox’s ConnectX-3 Infiniband interconnects.

This will take a huge amount of latency out of the system and, accordingly, reduce the amount of work the CPU needs to do to orchestrate GPU communications. The prototype system will have 64 nodes, each sporting a quad-core Tegra 3 CPU at 1.3GHz that will slide into a 4x PCIe slot on a Mini-ITX carrier. In this configuration, the CPU will only be managing boot and MPI communications, plus minimal traffic cop duty for the GPUs. The point is that you don’t need a hugely fast and powerful processor to fulfill these requirements.

However, Pedraforca v2 will have some processing power in the form of Kepler-based Nvidia K20 GPUs that can deliver 1,170GFLOPS through a PCIe Gen 3 slot. The GPUs will be able to communicate with each other at 40Gbps via the aforementioned Mellanox-fuelled Infiniband interconnect.

The chaps presenting this tech at GTC 2013 pointed out that this isn’t a general-purpose HPC system – it is intended as a host for apps that are GPU-optimised. While they didn’t discuss any FLOPS/watt estimates or performance predictions, it’s safe to say that Pedraforca v2 should be an eye opener when it comes to energy efficiency and even cost per FLOP. It’s definitely a project worth watching. ®

* CUDA on ARM architecture.

5 ways to reduce advertising network latency

Whitepapers

Microsoft’s Cloud OS
System Center Virtual Machine manager and how this product allows the level of virtualization abstraction to move from individual physical computers and clusters to unifying the whole Data Centre as an abstraction layer.
5 ways to prepare your advertising infrastructure for disaster
Being prepared allows your brand to greatly improve your advertising infrastructure performance and reliability that, in the end, will boost confidence in your brand.
Supercharge your infrastructure
Fusion­‐io has developed a shared storage solution that provides new performance management capabilities required to maximize flash utilization.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Avere FXT with FlashMove and FlashMirror
This ESG Lab validation report documents hands-on testing of the Avere FXT Series Edge Filer with the AOS 3.0 operating environment.

More from The Register

next story
Multipath TCP: Siri's new toy isn't a game-changer
This experiment is an alpha and carriers could swat it like a bug
Barmy Army to get Wi-Fi to the seat for cricket's Ashes
Sydney Test Match will offer replays to the smartmobe
Dedupe-dedupe, dedupe-dedupe-dedupe: Flashy clients crowd around Permabit diamond
3 of the top six flash vendors are casing the OEM dedupe tech, claims analyst
Disk-pushers, get reel: Even GOOGLE relies on tape
Prepare to be beaten by your old, cheap rival
Dragons' Den star's biz Outsourcery sends yet more millions up in smoke
Telly moneybags went into the cloud and still nobody's making any profit
Hong Kong's data centres stay high and dry amid Typhoon Usagi
180 km/h winds kill 25 in China, but the data centres keep humming
Microsoft lures punters to hybrid storage cloud with free storage arrays
Spend on Azure, get StorSimple box at the low, low price of $0
prev story