Feeds

Calm a CARMA drama chameleon: Barça super waves ARMs, GPUs

Cluster perks up with low-power CPUs bossing graphics chips

Choosing a cloud hosting partner with confidence

GTC 2013 Over the last few years, we’ve seen a steadily growing buzz surrounding the use of ARM processors in PCs, servers and supercomputers.

Here, at this year's GPU Technology Conference in California, that buzz is even more pronounced. This is due to Nvidia's upcoming 64-bit "Project Denver" ARM cores, and advances in its graphics chips to make machines even less dependent on a fast and powerful (read: Intel Xeon) processor feeding data to number-crunching GPU beasts. El Reg's Rik Myslewski penned a great article on GTC 2013 ARM chatter here.

While everyone has been debating and speculating about what it would be like to combine Brit-designed ARM cores and GPU accelerators, one organisation has put together some hardware in order to separate the theoretical from the real. The Barcelona Supercomputer Centre (that's Barcelona in Spain, not the other one) is building clusters to explore the potential advantages that might arise from combining nippy ARM-compatible chips with fast number-crunching GPUs.

The centre's first attempt, the Tibadabo, was a proof-of-concept system to determine whether it’s possible to build an all-ARM-based cluster. Could they really put together a cluster based on a low-power processor family that's ideally suited to mobile phones, hard drives and handheld games? And, if they could build it, could they find or adapt enough software for it to do useful work?

They were able to construct a two-rack cluster containing 32 blades, 256 nodes, and a total of 512 Tegra 2 ARM cores. They were able to port 11 scientific apps over to ARM's architecture with little difficulty, although they did need to fiddle around with the memory hierarchy to optimize some of the apps.

The performance wasn’t all that great. The total system turned out 512 billion floating-point calculations a second (512GFLOPS) while consuming 3.4kW, yielding 0.15GFLOPs/watt. For context, the best systems on the most recent Green-500 list - the top 500 supercomputers ranked by energy efficiency - come in around 2.4 or 2.5GFLOPs/watt; the systems at the end of the list are rated at 0.033GFLOPs/watt.

What's the world CARMA to?

So the Spanish brainiacs went back to the drawing board and clustered 16 Nvidia CARMA* development boxes as a learning experience they called Pedraforca v1. This system did much better than the ARM-only Tibadabo on energy efficiency, yielding .78GFLOPs/watts while running the DGEMM matrix-multiplication benchmark, so they were making progress.

Limitations in the platform (such as the max speed of 400MB/s over the PCIe bus plus an inability to overlap computation and data transfers) meant it couldn’t scaled up very well. However, it did lead them to a new breakthrough in their thinking for their next system, which they’ve dubbed Pedraforca v2.

They’ve decided the key to building a highly efficient system isn’t to erect an accelerated cluster but to build a cluster of accelerators. While there isn’t much difference in the words, there’s a world of difference between the meanings. For Pedraforca v2, they will decouple the CPUs from the GPUs, meaning that the ratio of general-purpose cores to graphics processor cores can be changed to fit the workloads. They will also use direct GPU-GPU data transfers via Mellanox’s ConnectX-3 Infiniband interconnects.

This will take a huge amount of latency out of the system and, accordingly, reduce the amount of work the CPU needs to do to orchestrate GPU communications. The prototype system will have 64 nodes, each sporting a quad-core Tegra 3 CPU at 1.3GHz that will slide into a 4x PCIe slot on a Mini-ITX carrier. In this configuration, the CPU will only be managing boot and MPI communications, plus minimal traffic cop duty for the GPUs. The point is that you don’t need a hugely fast and powerful processor to fulfill these requirements.

However, Pedraforca v2 will have some processing power in the form of Kepler-based Nvidia K20 GPUs that can deliver 1,170GFLOPS through a PCIe Gen 3 slot. The GPUs will be able to communicate with each other at 40Gbps via the aforementioned Mellanox-fuelled Infiniband interconnect.

The chaps presenting this tech at GTC 2013 pointed out that this isn’t a general-purpose HPC system – it is intended as a host for apps that are GPU-optimised. While they didn’t discuss any FLOPS/watt estimates or performance predictions, it’s safe to say that Pedraforca v2 should be an eye opener when it comes to energy efficiency and even cost per FLOP. It’s definitely a project worth watching. ®

* CUDA on ARM architecture.

Security for virtualized datacentres

More from The Register

next story
It's Big, it's Blue... it's simply FABLESS! IBM's chip-free future
Or why the reversal of globalisation ain't gonna 'appen
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Bitcasa bins $10-a-month Infinite storage offer
Firm cites 'low demand' plus 'abusers'
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
Pssst. Want to buy a timeshare in the clouds?
The Google dilemma — controller or spreader of knowledge?
CAGE MATCH: Microsoft, Dell open co-located bit barns in Oz
Whole new species of XaaS spawning in the antipodes
Microsoft and Dell’s cloud in a box: Instant Azure for the data centre
A less painful way to run Microsoft’s private cloud
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
New hybrid storage solutions
Tackling data challenges through emerging hybrid storage solutions that enable optimum database performance whilst managing costs and increasingly large data stores.