Arm emits Cortex-A76 – its first 64-bit-only CPU core (in kernel mode)

Apps, 32 or 64-bit, will continue to run just fine as design biz looks to ditch baggage

Arm Cortex-A76

Arm's latest top-end processor core design – the Cortex-A76 unveiled today – is its first CPU to only run kernel-level code that is 64-bit, according to its inventors.

The processor will still support applications that use 32-bit A32 and T32 instruction sets, as well as 64-bit A64, meaning your apps and other programs will run as normal on this CPU. Code in a non-privileged level (specifically, EL0) will work as expected on this processor design.

But go down into the privileged tiers of the device's operating system kernel, drivers, hypervisor, and so on, in EL1 and EL2 – and the Cortex-A76 will only support 64-bit code, we're told.

Linux and Android, Windows, and other operating systems built for this latest Cortex-A family member are being positioned, or are already positioned, to work within this 64-bit-only zone. The Cortex-As are licensed from Arm to chipmakers to power most of the world's smartphones and tablets. The Cortex-A76 is due to land in high-end phones and battery-friendly laptops by this time next year.

This move is part of Arm's unhurried stroll away from its years-old 32-bit instruction set. Supporting only 64-bit A64 in privileged modes simplifies the design of the CPU, freeing up space on the chip die for other features or potentially reducing its power draw.

Going fully 64-bit-only – for both apps and operating systems – is "the obvious next step," Peter Greenhalgh, Arm's veep of technology, told The Register, but stressed this will "take a little longer."

"We will do it at some point," he added. "That will mean moving a lot of people."

There is so much software out there that still runs in 32-bit mode on Arm-compatible chips, particular programs controlling embedded electronics, of which product engineers may no longer have access to the source code, that Arm will keep its 32-bit A32 instruction set around for a long while yet.

Smaller Cortex-A cores, such as the A53, that pair up with beefy performance cores, such as the Cortex-A76 and A75, in smartphones and other devices will likely continue to support A32, Greenhalgh said.

Your apps are safe – but developers, particularly kernel-level coders, should be aware of the direction Arm is heading. Google, for one, has nudged Android app makers toward providing 64-bit builds of their software. Companies with Arm architectural licenses have also embraced a 64-bit-only A64 approach, such as Cavium and its ThunderX2.

Culture change

Removing 32-bit code support, truth be told, doesn't just strip complexity from the CPU, leaving just the 64-bit Armv8-A aka A64 instruction set, a classic streamlined RISC ISA. It also liberates today's Arm staffers from the company's past – from the days of the late 1980s and early 1990s when Arm was part of Acorn, Britain's one-time answer to Apple.

Acorn decided to design its own processor from scratch, using RISC principles but with its own twist, and came up with the ARM1, and then the ARM2, until the CPU team was spun out to later form the public company Arm and went on to great things – from the ARM6 in the Apple Newton to the StrongARM with Digital to today's family of cores. Acorn didn't survive for much longer after that, fading away in 1998, but at least part of it lives on in the form of now-Softbank-owned Arm.

The legacy 32-bit ISA has all sorts of features that are a headache for CPU engineers to implement, but are a joy for programmers. For example, the architecture applies conditional execution to a lot of instructions – these instructions can check a status flag and can be skipped if the flag is or isn't set. For example, the following A32 Arm code performs Euclid’s Greatest Common Divisor in four instructions:

        CMP      r0, r1
        SUBGT    r0, r0, r1
        SUBLT    r1, r1, r0
        BNE      gcd

The first subtraction (SUBGT) is performed if the value in r0 is greater than r1, or otherwise skipped. The next subtraction (SUBLT) is performed if r0 is less than r1, or similarly ignored. A64, on the other hand, prefers to do what other architectures do and apply condition checks to just branching and looping instructions.

Then there's loading and storing multiple registers at a time with one instruction, such as...

LDMIA r13, {r3-r7, pc}

...which restores r3 to r7 from the stack and a return pointer to jump to by placing it in the program counter. Nice and easy, but a pain to unroll in hardware. These features are super useful for those crafting highly tuned assembly code, or writing compilers – and the thing is, not many people are left doing that these days. And the features get in the way of modern techniques, such as speculative execution, branch prediction, advanced prefetching, and so on.

Speaking to senior Arm staff, your humble hack got the impression that 2010s Arm would like to rid itself of this baggage. To have a very clean slate, to shed its quirky 32-bit mode much like 1990s Arm unceremoniously dumped its 26-bit mode that was a hangover from its early days. Yes, it stored its status flags in the upper bits of the program counter, leaving 26 bits for instruction fetches. Those days are long gone, too.

It would also stick it to some Arm purists, and a few former engineers, who believe A64 isn't "real" Arm.

New features

So, apart from this small change, what else is under the hood of the Armv8.2-A Cortex-A76? It will, ideally, be married to multiple Cortex-A55 cores to form system-on-chips built using 12nm, 7nm, and 5nm processes, where and when available. Ten nanometer is skipped. One A76 core should be, Arm staff said, about a quarter the size of a comparable Intel Skylake core.

As usual, Arm claimed the A76 will be faster and more power efficient than its previous Cortex-As, but how much faster honestly depends on the workload and how it is implemented and interconnected in an SoC, so it would be best to wait for people to test actual working silicon.

For what it's worth, Arm reckoned a 3GHz 7nm A76 single core is up to 35 per cent faster than a 2.8GHz 10nm Cortex-A75, as found in Qualcomm's Snapdragon 845, when running mixed integer and floating-point math benchmarks albeit in a simulator.

Each A76 has 128KB of L1 cache (64KB four-way instruction cache with four-cycle load-use latency, 64KB for data), 256 or 512KB of 1280-entry five-way L2 cache, and shares up to 4MB of L3. Here's its insides...

Internals of an Arm Cortex-A76 core

Click to enlarge ... Source: Arm

The front-end has a new and interesting mechanism: the branch predictor fetches 32 bytes from memory per cycle, running ahead of the instruction fetcher that consumes 16 bytes per cycle. The two operate separately. This allows the predictor to prime the core's caches way ahead of actual execution time with code that it reckons will be executed, and minimize bubbles in the pipeline during which the core can't do anything useful. Overall, this split method gives the A76 a lift over its predecessors.

Arm is all in on speculative execution. It has bet the ranch on it. It is not going away. Its processors would be hobbled without it, even though the approach has burdened Arm – like Intel and AMD and others – with Spectre vulnerabilities in its products, flaws that malware can exploit to obtain sensitive stuff from memory via side-channel leaks.

The A76's designers were reluctant to reveal exactly how they are mitigating these security oversights in the A76 and future CPU blueprints. Essentially, we're told, the core will pay much more attention to the context in which code is running to prevent software from snooping on other programs via these side-channel holes.

Address space identifiers, assigned by the operating system to each application or process, will be compared during speculative accesses so that one app can't leak what it's up to from the way it touches the caches. Arm has an easier job of defending against Spectre because its cores are not multithreaded – there isn't a second hardware thread in a core to snoop on the first.

As for blueprints for CPUs already out in the wild: the Cortex-A72, A73, and A75 designs will be tweaked to help thwart Spectre variant two attacks, and A75 will be rejigged to stop Meltdown-style prying, we're told. Plus, of course, there are software-level mitigations that can be applied.

Here's how the decode-to-execute stages of the pipeline break down:

Internals of an Arm Cortex-A76 core


Internals of an Arm Cortex-A76 core


Each core has a 6,000-entry branch target buffer in its branch predictor, a 48-entry fully associative L1 instruction translation lookaside buffer, and can handle up to 68 load and 72 store instructions in flight at any one time, with up to 20 outstanding non-prefetch misses.

Overall, there have been hundreds of tweaks to the architecture to squeeze an extra tiny percent here, a tiny percent there, out of the hardware, all adding up to a collective boost. Engineers have stepped through instructions, some as many as 100 pipeline stages long, to ensure there are no wasted cycles, to speed up privilege level changes and thus speed up context switches, to reduce branch latencies and increase bandwidth, and so on.

Chips and devices – think premium smartphones and netbooks – using the A76 should be in production by the end of the year, and on sale by the first quarter of 2019. System-on-chip makers already have access to the designs, with test silicon in hand, we're told.

And finally

Meanwhile, Arm also today unveiled its Mali G76 premium graphics processor designs, and the V76 video wrangler, to go alongside the A76 compute cousin. Again, the G76 is said to be faster and less power hungry than its predecessors: it has double the number of execution lanes than the G72, performing eight 32-bit floating-point vector math operations every three clock cycles. It also gains 8-bit integer dot-product inference, which is useful for machine learning.

The G76 has therefore eight lanes per processing engine, three engines per core, 20 cores per GPU, thus 480 total lanes versus the G72's 384, which means it gives you more bang for your buck. The texture mapper also has twice the throughput of the G72, Arm said.

The V76 can decode 8K video at 60 frames per second, we're told, generating a video wall if necessary of four-by-four screens at 1080p, or two by two at 2160p, at 60 FPS. Think of this stuff for smart TVS as well as tablets and other handheld gadgets. It can decode HEVC, VP9, VP8, H.264, AVS+, and other formats. ®

Sponsored: Minds Mastering Machines - Call for papers now open

Biting the hand that feeds IT © 1998–2018