ARM's new Cortex-A35: How to fine-tune a CPU for web browsing on bargain smartphones

Proof that Brits are experts on forming queues

ARM, designer of smartphone brains, will today reveal the Cortex-A35: a processor core subtly tweaked to run mobile web browsers and similar apps faster.

How can a CPU be tuned for something seemingly so specific? The answer lies in the A35's instruction prefetch queue. ARM has halved the length of this queue in an attempt to balance memory access bandwidth against faster processing. It's not a major breakthrough in computer science, but an interesting change, nonetheless.

Modern processors have an instruction pipeline that is constantly primed with program code for the CPU to execute. The longer the pipeline, the more instructions the core can carry out at the same time.

There's one little problem with this otherwise neat design: the processor can burn through its pipeline faster than your RAM can give it instructions. In other words, there needs to be a stockpile of code to satisfy the CPU's hunger.

The answer to this headache is a prefetch instruction queue: this is a buffer of program code that's automatically filled up by the core's prefetch unit during idle moments, providing instructions for the execution pipeline to gobble up.

The Cortex CPU is either accessing data in memory or filling its prefetch buffer: it can't do both at the same time. So for the new A35, it's understood ARM has reduced this queue length from its other Cortex cousins, so the processor can spend more time accessing stuff in memory. We understand the queue is half the length of the Cortex-A7 and A5's.

Back to the drawing board ... ARM's redesigned instruction fetch unit in the Cortex-A35 (click to enlarge)

This balancing of bandwidth apparently suits the software in web browsers and similar apps, which touch lots of data – images, HTML, etc – stored in RAM all the time. Also, the branch prediction machinery – the mechanism that tries to keep the processor fed with upcoming instructions as it runs through the twisting and turning maze of program code – has also been improved to reduce stalls within the execution pipeline; that's another bonus for apps, it's claimed.

Faster surfing ... ARM claims the changes to its instruction queue design make the A35 nearly twice as fast as the A7 when running web browser apps

The instruction fetch unit's redesign was requested by smartphone processor chips makers, who license ARM's blueprints to build the brains of hundreds of millions of phones and tablets each year.

The A35, to be unveiled today at the ARM TechCon conference in California, is a 64-bit ARMv8-A follow-up to the 32-bit ARMv7-A Cortex-A7, which we first saw in 2011. The modest but capable A7 is used in the Qualcomm Snapdragon 200 and 400 family of chips for smartphones, and the Raspberry Pi 2, among other things. Like the A7, the A35 has an in-order eight-stage pipeline with limited dual issue capability.

It's essentially pitched at entry-level to mid-range handhelds, and embedded gear that needs some extra oomph: an A7 with a 64-bit mode, and extra kick, if you will. If you want real raw power, and you're a smartphone chip maker, you'll plump for a set of beefy Cortex-A53, A57 or A72 cores.

The A35 borrows some bits and pieces from the A50-series: things like multi-stream data prefetching, and load-acquire-store-release instructions. It also has power control improvements over the A7; ARM claims the A35 is smaller and more energy efficient than its predecessor.

A single A35 core with an 8KB L1 cache, the most barebones configuration, takes up less than 0.4mm2 on a silicon die using a 28nm process. A single core clocked at 1GHz with all features enabled will consume about 90mW, we're told; the smallest possible core running at 100MHz will consume roughly 6mW.

We understand ARM's licensees poked the Brit chip architects for an A7 successor with 64-bit mode and other A50-grade features, plus an instruction queue that's more friendly for browser-like apps, for the next generation of cheap handhelds.

"This is a combination of us working with silicon partners and manufacturers to get a balance between performance and efficiency and the die area of the cores," James Bruce, ARM's director of mobile solutions, told The Register.

"Typically, mobile workloads, browsing and gaming are memory intensive. The processor has to fetch a lot of data from RAM. The A7 hit a very sweet spot of performance and balancing power, and for the A35, we made decision to look at improving mobile workloads.

"As always, there is a gate budget: you have to deliver the biggest bang for the gates you’ve got."

More improvements ... The A35 borrows L1 and L2 cache features from its Cortex cousins to get a leg up from the A7

This is the fascinating thing about ARM's business model: the Cambridgeshire-based company is constantly in talks with its customers, tweaking and refining cores to suit them, and then licensing the result. It is no surprise to hear of rumors that Google is tapping up people with chip design experience for mobile hardware: it is a trivial expense for the web giant to get out its checkbook, license a core design (say, one of ARM's), customize it, and get someone to fabricate a chip meeting the goliath's requirements.

Mode operandi ... Features of the A35 when in 32-bit AArch32 or 64-bit AArch64 mode

Browse to arm.com later today for more details. ®

Sponsored: Minds Mastering Machines - Call for papers now open


Biting the hand that feeds IT © 1998–2018