Feeds

Can't wait for Nvidia? Try these Italian baby ARM clusters with GPU options

E4 Computer peddles a chip off of Mont Blanc

Top three mobile application threats

Nvidia has made no secret about wanting to be a player in the supercomputer racket both on the GPU and CPU sides of a hybrid system. The company launched "Project Denver" nearly two years ago to create a Nvidia-branded chip, which will see Denver ARM processors timed to market with the future "Maxwell" GPUs two years from now. But maybe you can't wait that long to get started on using a ceepie-geepie hybrid, and maybe you don't want to build your own machine from expensive CPU and GPU cards. That's where E4 Computer Engineering comes in.

Ten-year-old Italian cluster-maker E4 Computer - a big supplier of standard x86 clusters to the CERN lab in Switzerland among others - has partnered with SECO, the Italian firm that makes embedded x86 and ARM boards for various uses, to bring out baby ARM clusters with GPU options.

The Carma Microcluster and full-on Carma Cluster machines are based on Qseven embedded ARM processor boards that SECO creates for embedded customers, and you will recognize the Qseven ARM board if you follow the HPC market or build embedded systems. The ARM boards that are in the E4 Computer servers are similar to the ones that were originally used in the experimental "Mont Blanc" machine, which paired quad-core Tegra3 ARM-based chips with Nvidia modile GPUs, at the Barcelona Supercomputing Center in Spain.

The Carma Microcluster rack and tower machines, so named because they run the CUDA parallel application development environment on ARM processors, were being shown off at the SC12 supercomputing conference in Salt Lake City last week, as were their Carma Cluster microserver variants.

The Carma cluster can be a 5U racker or a tower box

The Carma cluster can be a 5U racker or a tower box

As E4 Computer correctly puts it, the current generations of quad-core Tegra3 processors from Nvidia and their 32-bit peers from other licensees of the ARM designs are somewhat challenged in the floating point department. But pairing an ARM processor with a GPU – essentially a modern-style, outboard math coprocessor like Intel used to offer in a special socket for x86 CPUs before they were brought on-chip with the 80486SX and Pentium chips – does the trick just nicely, as the Mont Blanc experimental machine demonstrates.

The second prototype system from BCS built was supposed to pair the Qseven card from SECO with a single Tegra3 processor, which has four Cortex-A9 cores running at 1.5GHz plus a fifth baby core for management. (This is the so-called "big.Little" architecture that ARM is espousing to support different-sized workloads with a single chip.) That Tegra3 card has 4GB of memory and a Gigabit Ethernet port. An Nvidia GeForce 520MX GPU for laptops was implemented on a side board and linked to the Tegra3 board it. That GPU is of the "Fermi" generation and has 48 cores. A Mont Blanc-2 1U server has eight of these Tegra3-GeForce 520MX combos in the box, and then 32 blades and 10 Gigabit Ethernet switches in a rack to deliver 38 teraflops of floating point oomph in a 5 kilowatt power envelope, for 7.5 gigaflops per watt. That's almost three times better performance per watt than big CPU or CPU-GPU machines delivered on the latest Top500 supercomputer rankings.

The Mont Blanc project has subsequently chosen an integrated CPU-GPU device for its supercomputer. And that is the Exynos 5 processor from Samsung Electronics, which implements two Cortex-A15 cores on a die using 32 nanometer processes and running at 1.7GHz; the chip also has an ARM Mali-T604 GPU.

The Tegra3-Quadro 1000 ceepie-geepie hybrid board, made by SECO

The Tegra3-Quadro 1000M ceepie-geepie hybrid board, made by SECO

The Carma machines from E4 Computer use the newer Quadro 1000M mobile graphics cards (PDF) from Nvidia, code-named "Huron River," which have 96 CUDA cores and which burn 45 watts. The Quadro 1000M delivers 270 gigaflops of single-precision (32-bit) floating point performance, which doesn't sound like much when you see what the Nvidia Tesla K10 GPU coprocessor can deliver in terms of single-precision oomph, but it is about 80 per cent more SP floating point performance than the GeForce 520MX had - and that is a nice jump.

The Tesla K10 is the single-precision monster at Nvidia, delivering 4.58 teraflops, or 20.3 gigaflops per watt peak on a 225 watt card, compared to 6 gigaflops per watt for the Quadro 1000M. But the Carma Microcluster is a development machine, not a performance beast, so this is about putting a baby cluster in a box that software can be created and tested on, and for a much lower cost than a Cadillac Xeon-Tesla setup might have. The Carma machines also let companies get ahead of the curve on ARM-based iron.

The E4 Carma baby cluster based on the SECO boards

The E4 Carma Microcluster development cluster based on SECO ARM boards

The Carma Microcluster has one SECO Qseven board per blade in a chassis that is 5U in size; that chassis can hold up to eight blades. The chassis can be mounted in a rack or tipped on its side and used as a tower server, perhaps tucking it in a closet or beside your desk. The Microcluster machine, ironically enough, has an internal x86 processor that is used for managing the blades in the chassis as well as being used for cross-compiling on both ARM and x86 processors from the same machine.

The Microcluster box has an aggregate of 2,160 gigaflops across those eight blades. With the x86 management node included, the whole machine draws 600 watts of power at system level, which works out to 3.6 gigaflops per watt at the system (rather than at the GPU) level.

By the way, SECO does not play favorites. It has embedded boards with Freescale ARM processors and is ready to launch one with the latest OMAP ARM variant from Texas Instruments. And the company has been selling Qseven boards based on the Fusion G Series APUs, which have on-die Radeon HD6200 series GPUs, as well as on Intel Atom E600 series processors.

Production Carma boxinis

If you're thinking of putting the Tegra-Quadro combo into production, E4 Computer has another machine that it thinks is more appropriate, and one that can be deployed as either an ARM-only setup for integer work as well as ceepie-geepie jobs. The Carma Cluster is a microserver design that puts a dozen blades and two power supplies into a 3U rack enclosure. Each Carma2 blade has two Tegra3-Quadro 1000M boards on it, for a total of 24 Tegra3 processors and 24 Quadro GPUs per enclosure, yielding 6,480 gigaflops in that 3U enclosure. This machine is estimated to draw around 1,500 watts according to preliminary data from E4 Computer.

The Carma Cluster microserver chassis can do just ARMs, or ARMs plus GPUs

The Carma Cluster microserver chassis can do just ARMs, or ARMs plus GPUs

If you want to go CPU-only, to run web servers or do Hadoop Big Data munching, then there are blades known as Darma – possibly short for dual ARM servers – that put 48 Tegra3 chips (192 usable ARM cores) into a single chassis with four SECO cards on each blade. This Darma setup is estimated to draw 400 watts. You can mix and match the Carma2 and Darma microservers inside of the Carma Cluster chassis.

Clearly, getting a CPU and GPU on a single die, as Nvidia is planning to do with Project Denver, would be a much better option in terms of thermals and performance. But that is many years away, and machines such as those made by E4 Computer let you get started on the programming now, so that you will be ready then.

Pricing information was not available at press time for the Carma and Darma machines. ®

High performance access to file storage

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Inside the Hekaton: SQL Server 2014's database engine deconstructed
Nadella's database sqares the circle of cheap memory vs speed
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
AMD's 'Seattle' 64-bit ARM server chips now sampling, set to launch in late 2014
But they won't appear in SeaMicro Fabric Compute Systems anytime soon
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
prev story

Whitepapers

Top three mobile application threats
Learn about three of the top mobile application security threats facing businesses today and recommendations on how to mitigate the risk.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
SANS - Survey on application security programs
In this whitepaper learn about the state of application security programs and practices of 488 surveyed respondents, and discover how mature and effective these programs are.