Calm a CARMA drama chameleon: Barça super waves ARMs, GPUs

Cluster perks up with low-power CPUs bossing graphics chips

Beginner's guide to SSL certificates

GTC 2013 Over the last few years, we’ve seen a steadily growing buzz surrounding the use of ARM processors in PCs, servers and supercomputers.

Here, at this year's GPU Technology Conference in California, that buzz is even more pronounced. This is due to Nvidia's upcoming 64-bit "Project Denver" ARM cores, and advances in its graphics chips to make machines even less dependent on a fast and powerful (read: Intel Xeon) processor feeding data to number-crunching GPU beasts. El Reg's Rik Myslewski penned a great article on GTC 2013 ARM chatter here.

While everyone has been debating and speculating about what it would be like to combine Brit-designed ARM cores and GPU accelerators, one organisation has put together some hardware in order to separate the theoretical from the real. The Barcelona Supercomputer Centre (that's Barcelona in Spain, not the other one) is building clusters to explore the potential advantages that might arise from combining nippy ARM-compatible chips with fast number-crunching GPUs.

The centre's first attempt, the Tibadabo, was a proof-of-concept system to determine whether it’s possible to build an all-ARM-based cluster. Could they really put together a cluster based on a low-power processor family that's ideally suited to mobile phones, hard drives and handheld games? And, if they could build it, could they find or adapt enough software for it to do useful work?

They were able to construct a two-rack cluster containing 32 blades, 256 nodes, and a total of 512 Tegra 2 ARM cores. They were able to port 11 scientific apps over to ARM's architecture with little difficulty, although they did need to fiddle around with the memory hierarchy to optimize some of the apps.

The performance wasn’t all that great. The total system turned out 512 billion floating-point calculations a second (512GFLOPS) while consuming 3.4kW, yielding 0.15GFLOPs/watt. For context, the best systems on the most recent Green-500 list - the top 500 supercomputers ranked by energy efficiency - come in around 2.4 or 2.5GFLOPs/watt; the systems at the end of the list are rated at 0.033GFLOPs/watt.

What's the world CARMA to?

So the Spanish brainiacs went back to the drawing board and clustered 16 Nvidia CARMA* development boxes as a learning experience they called Pedraforca v1. This system did much better than the ARM-only Tibadabo on energy efficiency, yielding .78GFLOPs/watts while running the DGEMM matrix-multiplication benchmark, so they were making progress.

Limitations in the platform (such as the max speed of 400MB/s over the PCIe bus plus an inability to overlap computation and data transfers) meant it couldn’t scaled up very well. However, it did lead them to a new breakthrough in their thinking for their next system, which they’ve dubbed Pedraforca v2.

They’ve decided the key to building a highly efficient system isn’t to erect an accelerated cluster but to build a cluster of accelerators. While there isn’t much difference in the words, there’s a world of difference between the meanings. For Pedraforca v2, they will decouple the CPUs from the GPUs, meaning that the ratio of general-purpose cores to graphics processor cores can be changed to fit the workloads. They will also use direct GPU-GPU data transfers via Mellanox’s ConnectX-3 Infiniband interconnects.

This will take a huge amount of latency out of the system and, accordingly, reduce the amount of work the CPU needs to do to orchestrate GPU communications. The prototype system will have 64 nodes, each sporting a quad-core Tegra 3 CPU at 1.3GHz that will slide into a 4x PCIe slot on a Mini-ITX carrier. In this configuration, the CPU will only be managing boot and MPI communications, plus minimal traffic cop duty for the GPUs. The point is that you don’t need a hugely fast and powerful processor to fulfill these requirements.

However, Pedraforca v2 will have some processing power in the form of Kepler-based Nvidia K20 GPUs that can deliver 1,170GFLOPS through a PCIe Gen 3 slot. The GPUs will be able to communicate with each other at 40Gbps via the aforementioned Mellanox-fuelled Infiniband interconnect.

The chaps presenting this tech at GTC 2013 pointed out that this isn’t a general-purpose HPC system – it is intended as a host for apps that are GPU-optimised. While they didn’t discuss any FLOPS/watt estimates or performance predictions, it’s safe to say that Pedraforca v2 should be an eye opener when it comes to energy efficiency and even cost per FLOP. It’s definitely a project worth watching. ®

* CUDA on ARM architecture.

Remote control for virtualized desktops

More from The Register

next story
NSA SOURCE CODE LEAK: Information slurp tools to appear online
Now you can run your own intelligence agency
Fat fingered geo-block kept Aussies in the dark
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
Stop the IoT revolution! We need to figure out packet sizes first
Researchers test 802.15.4 and find we know nuh-think! about large scale sensor network ops
Turnbull should spare us all airline-magazine-grade cloud hype
Box-hugger is not a dirty word, Minister. Box-huggers make the cloud WORK
SanDisk vows: We'll have a 16TB SSD WHOPPER by 2016
Flash WORM has a serious use for archived photos and videos
Astro-boffins start opening universe simulation data
Got a supercomputer? Want to simulate a universe? Here you go
Do you spend ages wasting time because of a bulging rack?
No more cloud-latency tea breaks for you, users! Get a load of THIS
prev story


Designing and building an open ITOA architecture
Learn about a new IT data taxonomy defined by the four data sources of IT visibility: wire, machine, agent, and synthetic data sets.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
5 critical considerations for enterprise cloud backup
Key considerations when evaluating cloud backup solutions to ensure adequate protection security and availability of enterprise data.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Protecting users from Firesheep and other Sidejacking attacks with SSL
Discussing the vulnerabilities inherent in Wi-Fi networks, and how using TLS/SSL for your entire site will assure security.