Little ARMs pump 2,048-bit muscles in training for Fujitsu's Post-K exascale mega-brain

NEON is so 80s, er, 90s, no, 2000s, wait, 2010s

Hot Chips ARM is bolting an extra data-crunching engine onto its 64-bit processor architecture to get it ready for Fujitsu's Post-K exascale supercomputer.

Specifically, ARM is adding a Scalable Vector Extension (SVE) to its ARMv8-A core architecture. SVE can handle vectors from 128 to 2,048 bits in length. This technology is not an extension to NEON, which is ARM's stock SIMD vector processing unit; we're told SVE will be separate.

Processors featuring 64-bit ARMv8-A cores with SVE will power Fujitsu's Post-K machine, which is due to go live in 2020 and crunch roughly 1,000 peta-FLOPS or a quintillion floating-point math calculations a second. It is set to be the world's fastest known supercomputer by the time it's fully switched on. Surprisingly, it will be powered by the ARM architecture, which is the brains in nearly all smartphones, tablets, portable gadgets, embedded systems and so on.

SVE is an SIMD feature: it allows the CPU to run calculations on multiple arrays of data at a time. Its sister NEON [chapter 7, PDF] operates on 64 and 128-bit-wide vectors that can each contain 16, 32 or 64-bit elements. For example, the NEON instruction...

vadd.i32 q1, q2, q3

...adds four 32-bit integer elements in the 128-bit register q2 to the corresponding four elements in 128-bit register q3 and stores the resulting array in q1. It's the equivalent of doing in C...

for(i = 0; i < 4; i++) a[i] = b[i] + c[i];

...where q2 stores the integer array b[] and q3 stores c[]. This can be used to, say, increase the brightness of an image, by running through the pixels in 128-bit blocks and increasing their value. NEON works with integer and floating-point values, and has all sorts of other tricks up its sleeve, such as rearranging the elements in a vector to, say, split the left and right audio channels from a data stream. It's designed for rapid multimedia processing, which you'd expect in phones, displays and other gizmos.

SVE works in a similar fashion in that it operates on large arrays of data at once, processing up to 2,048 bits per vector per instruction – 16 times more information per vector per instruction than NEON.

To get ARMv8-A ready for high-performance computing, ARM added SVE to its CPU core blueprints so they can can handle supercomputer workloads where you definitely don't want to be shuttling through data just 128 bits at a time – you want processing done in the largest possible blocks. SVE allows ARM's high-end cores to cope better with demanding simulation and modeling applications that boffins want to run on their big iron.

ARM engineers are a few weeks away from submitting patches to the GCC and LLVM teams to support auto-vectorization with SVE, which means that software built by these open-source compilers can automatically generate instructions that take advantage of long vectors without developers having to customize their apps.

And once a program has been built for SVE, it will run comfortably on any SVE-capable processor without recompilation, whether the CPU has support for 512, 1,024 or the full 2,048 bits. The SVE unit can automatically break a 2,048-bit vector into, say, four 512-bit vectors if its silicon implementation doesn't support the full length.

Auto-vectorization is present in many compilers, including proprietary packages from HPC software vendors, ARM and Intel as well as GCC and LLVM. It works by automagically identifying loops that work through arrays and only break when they reach a known limit. Rather than unrolling these loops or doing lots of loads and stores, the compiler emits SIMD instructions that suck up, process and commit data in blocks, which streamlines the operation.

On ARM at least, SIMD instructions run alongside normal instructions, increasing parallelization.

As an example of how auto-vectorization can be used without having to change any source code, take a C function like this...

void vectorize_this(unsigned int *a, unsigned int *b, unsigned int *c)
  unsigned int i;
  for(i = 0; i < SIZE; i++)
    a[i] = b[i] + c[i];

...and gcc -o loop -O3 loop.c will compile it into something like this for a 32-bit ARM device:

104cc: ldr.w   r3, [r4, #4]!
104d0: ldr.w   r1, [r2, #4]!
104d4: cmp     r4, r5
104d6: add     r3, r1
104d8: str.w   r3, [r0, #4]!
104dc: bne.n   104cc <vectorize_this+0xc>

That lots of individual 32-bit integer loads and stores, with the arrays b[] and c[] pointed to by r4 and r2, and the result stored at the address pointed to by r0. Compile it with gcc -o loop -O3 -mfpu=neon -ftree-vectorize loop.c and auto-vectorization will kick in, so you'll get something like:

10780: vld1.64   {d18-d19}, [r5 :64]
10784: adds      r6, #1
10786: cmp       r6, r7
10788: add.w     r5, r5, #16
1078c: vld1.32   {d16-d17}, [r4]
10790: vadd.i32  q8, q8, q9
10794: add.w     r4, r4, #16
10798: vst1.32   {d16-d17}, [r3]
1079c: add.w     r3, r3, #16
107a0: bcc.n     10780 <vectorize_this+0x70>

This uses the ARM NEON vldl instruction to load the q8 and q9 SIMD registers with data from arrays pointed to by r5 and r4, vadd to add them together, four 32-bit integers at a time, and vst1 to store the resulting array in memory at the address in r3. It's slightly more code but, crucially, more work done and thus more data crunched per loop iteration.

Now imagine this with 2,048-bit vectors and you're looking at SVE, which has its own instruction set separate from NEON. On SVE, software can control the length of the vectors, up to the maximum, and floating-point and integers are supported.

Fujitsu's Post-K beast will replace Japan's K Computer, a 10.5-PFLOPS 12MW goliath that's built out of 705,000 Sparc64 VIIIfx processors and is the world's fifth fastest known supercomputer today. The Sparc64 VIIIfx in the K machine provides SIMD operations as part of HPC-ACE [PDF] which the ARM cores in the Post-K will have to at least match. Basically, ARM had to bolt on some extra oomph to its CPU cores' vector processing to satisfy Fujitsu's requirements for the Post-K.

The dream for ARM, Fujitsu and Japan's boffins is being able to recompile scientific applications written for the K and other supers so they can run on the ARMv8-A-with-SVE Post-K, and let the auto-vectorization harness the benefits of long vectors without having to rework chunks of code and do lots of manual optimization.

ARM staff are due to reveal more technical and performance details of SVE at the Hot Chips 2016 conference in Cupertino, California, today. We'll be there to get the latest info and fill you in. ®

Biting the hand that feeds IT © 1998–2017