Original URL: https://www.theregister.com/2014/06/04/amd_releases_new_kaveri_mobile_apus/

AMD tops processor evolution with new mobile Kaveri chippery

'Big A' architecture change occurs 'once every 15, 20 years,' says AMD CTO

By Rik Myslewski

Posted in Personal Tech, 4th June 2014 04:01 GMT

Deep Tech AMD has unveiled the mobile version of its "Kaveri" desktop processor, topping off its series of processors in which CPUs and GPUs not only reside on the same die, but also work together in shared-memory harmony – or heterogeneity, to be more precise.

AMD Kaveri for Mobile: overview

The top end of AMD's new mobile Kaveri line provides an impressive list of specs (click to enlarge)

"This a culmination of five years of work to build out this APU roadmap," AMD senior director of mobility solutions Kevin Lensing told a gaggle of hacks at a briefing last month.

We're not entirely clear where Lensing gets his "five-year" benchmark – perhaps some internal AMD starting gun – but the concept of the APU, or accelerated processing unit, what AMD calls its CPU/GPU/whatever mashup, was first introduced at an AMD financial analysts' event in 2006, first demoed at Computex in June 2010, and first shipped in November of that year in the 18-watt Zacate and 9-watt Ontario parts.

AMD has also been working on the CPU/GPU shared-memory heterogeneous system architecture (HSA) for some time, as well, tossing an OpenCL net over their CPUs and GPUs as early as 2009, and talking heterogeneity up big at their 2011 Fusion Summit. AMD was also one of the founding members of the HSA Foundation in June 2012, along with ARM, Imagination, MediaTek, and TI.

Now, Wednesday's release of the new HSA-enabled Kaveri chip mobilizes AMD's heterogeneous product stable, following this January's release of the desktop Kaveri, April's demo of the HSA-enabled "Berlin" Operton server chip due later this year, and last month's introduction of the HSA-enabled "Bald Eagle" APU for high-end embedded applications.

At each of those announcements, AMD asked if we in the computing world would be so kind as to toss the terms CPU and GPU into the lexical Osterizer and blend them into one. "CPU and GPU cores are now sort of equal citizens in terms of their ability to do compute work," Lensing explained at the briefing, "so we'll combine them together into a very simple concept that we'll call 'compute cores'."

But CPUs and GPUs aren't, of course, equal – CPUs excel in linear processing and GPUs show their muscle when handling highly parallel tasks. Lensing is not trying to fool anybody, however – he's merely a marketing guy who needs to simplify concepts when talking to, for example, the retail channel. So let's play along, and think of the equality of CPU and GPU compute cores in a Marxian fashion: "From each according to his ability, to each according to his need."

What allows Kaveri's CPU and GPU compute cores to play well together are one lower-case and upper-case acronym, hUMA (heterogeneous Uniform Memory Access), and one lower-case and upper-case initialism, hQ (heterogeneous Queing).

These two chunks o' tech exist to enable CPUs and GPUs to work together more efficiently. In the bad ol' pre-HSA days, if a CPU recognized that its associated GPU was better-suited to a task, it had to copy the relevant data from its own stash into the GPU's memory, then copy it back after the GPU had finished with it. The GPU was too stupid to manage scheduling, queuing, and the like.

Enter hUMA and hQ. In a nutshell, the hUMA architecture allows for one shared pool of memory for all CPU and GPU compute cores, and hQ assures that any compute core, whatever its specialty, can monitor task queues to accept and schedule tasks, for both itself and its brethren. [Enough of this "compute cores" crapola – just call 'em CPUs and GPUs.—Ed.]

So exactly what cores are in the mobile Kaveri?

Enough about CPU and GPU cores as abstract concepts – how many of the li'l buggers does the mobile Kaveri APU have and how fast do they run? The answer, of course, is "it depends" – cores and clock speeds vary by model.

More on that later, but for now let's just talk about the top-end Kaveri part – prosaically named the AMD A-Series FX-7600P with Radeon R7 Graphics – which has four "Steamroller" CPU cores and eight "Graphics Core Next" (GCN) GPU cores – as was also true with January's top-end desktop Kaveri. The FX-7600P's CPU cores run at a base clock of 2.7GHz and boost to 3.6GHz, and its GPU cores max out at 686MHz.

AMD Kaveri for Mobile: cores overview

The new APU, AMD says, offers a big boost over its predecessors, 'Trinity' and 'Richland' (click to enlarge)

Steamroller is the third generation of AMD's Tonka toy–named CPU cores – from Bulldozer to Piledriver to Steamroller.

AMD CTO Joe Macri, also at the briefing, veered a bit off Lansing's "equal citizens" messaging regarding CPU and GPU cores. "I always tell folks that the most important compute device is the CPU core," he said. Macri noted that AMD has over 500 designers on its CPU team alone.

He also noted that Steamroller is an evolutionary design. "CPU design is about a lot of evolution," he said. "There's not a whole lot of revolution left in CPUs, but there's a lot of evolution left."

One area in which Kaveri's CPU cores are evolving is in the basic metric of instructions per cycle (or clock), usually expressed as IPC. "There is no glass ceiling in IPC," Macri said. "We used to think there was one, that a little bit above 1 was as much as we could ever get, but [AMD chief cores architect Jim] Keller really believes that we can keep pumping that IPC higher and higher as we go forward."

With the mobile Kaveri's release, Macri, Keller, and their 500 compatriots have evolved the CPU cores to raise IPC as much as 20 per cent over its predecessors, "Trinity" and "Richland", with an average IPC boost of around 10 per cent. Accomplishing this, of course, involved a boatload of tweaks. "It's complicated," Macri said. "There's no two ways about it.

AMD Kaveri for Mobile: CPU cores block diagram

The new Steamroller cores employ a variety of enhancements to boost IPC (click to enlarge)

As examples of that complicated effort, Macri cited the team's work on the front end of the processor, which resulted in a 30 per cent reduction in instruction-cache misses by "basically" adding 50 per cent more instruction cache. The team also knocked mispredicted branches down by about 20 per cent by both improving the branch-prediction algorithm and doubling the size of the branch target buffer to "about 10K entries," he said.

Scheduling efficiency is up by 5 to 10 per cent. "Scheduling is basically the window of instructions you're looking at," Macri said, "so we increased the window from 40 to 48 entries so we can look at more instructions and find things that don't conflict." One other front-end improvement involved removing conflicts by dumping the old shared integer decoder and simply having two independent ones.

Macri, Keller, et al. worked on the back end of Kaveri as well. For example, its predecessors' memory subsystems could only issue one store per cycle; Kaveri can issue two. Also, they increased the size of the queues through which loads and stores pass in and out of.

"All of this improves performance dramatically from an IPC point of view," Macri said. "These are all IPC tricks – the amount of work you get done in a single cycle. That's the most efficient way to improve CPU performance."

GPU cores get some love, as well

As mentioned above, Kaveri's GPU cores are based on AMD's GCN architecture, first unveiled in July 2011 and now extending throughout the full range of AMD's offerings. "Now every product in the AMD portfolio," Lansing said, "from the 2-watt tablet to the multiple-hundreds-of-watts discrete graphics to all the game consoles are now unified on the GCN architecture" – which, by the way, supports AMD's low-level Mantle API for juicing gaming performance, as well as DirectX 11.2.

AMD Kaveri for Mobile: GPU cores block diagram

AMD has devoted a hefty chunk of on-die real estate to Kaveri's GPU cores (click to enlarge)

Macri said that the mobile Kaveri's designers allocated nearly half of its entire 245mm2, 2.41 billion–transistor, 28nm die to graphics and other accelerators – the other 53 per cent are filled by CPU cores, caches, I/O, power management, and other housekeeping stuff that chipsters often call the "uncore" – for one simple reason. "It's not about a spreadsheet or text or simple things like that anymore," he said. "Now it's all about visualization."

Data analysis has moved beyond numbers into shapes, graphs, 3D, and other visualization methods, he said, and a beefy graphics and multimedia subsystem is needed to keep up with both the parallelized crunching of that data and the presentation of the resulting analysis. "We all work with our eyes," Macri said. "One picture is worth more than a thousand words."

Well, there's that – but there's also gaming. According to Macri, stats from Steam show that 35 per cent of their gamers have rigs that are less powerful that the GPU in the mobile Kaveri.

Kaveri's GPU is essentially a version of AMD's "Hawaii" GCN cores – part of their "Volcanic Islands" series, successor to "Sea Islands" and precursor to "Pirate Islands", which should begin appearing next year. There are, however, a few changes.

The two big differences, Macri said, are coherency and context-switching – both key elements of HSA, for which Hawaii, being strictly a graphics architecture, had no need.

The are eight graphics compute units in the top-of-the-line Kaveri part, each with 512 IEEE 2008–compliant floating point–capable shaders, as well as a flat address space, which Macri characterized as "absolutely key." Some precision improvements have been added, as well.

AMD Kaveri for Mobile: single GPU core block diagram

The new Kaveri line includes parts with as few as three or as many as eight GPU cores (click to enlarge)

A 64KB local data share minimizes the off-die needs of the GPU, improving power efficiency. "This is a big performance-per-watt improvement of the graphics back end," said Macris.

One elegant feature of the HSA-capability of the mobile Kaveri's GPU is the fact that the eight "compute units" that comprise it are all asynchronous. They're all able to go off and do whatever the hell they want to do – or are told to do – whenever the hell they want without needing to consult with their brethren.

"They can each run their own set of tasks," Macri said. "They work off a set of dispatch queues – each one can manage up to eight queues – so they can basically be working with different pieces of different threads."

The addition of fast context-switching – one of the GPU cores' upgrades from Hawaii – is only employed when the GPU is performing a compute task, Macri emphasized. "We haven't applied context-switching to 3D graphics yet," he said, noting that the state of a 3D process currently occupies most if not all of a graphics engine. "It's very big," Macri said. "We're working on how to make that work, but we're not here to talk about that today."

So how fast is this puppy?

So how many FLOPS do all these compute cores – CPUs and GPUs working together – add up to? In his role as marketeer, Lansing was understandably effusive. "The total amount of compute FLOPS is unprecedented," he said. "With the mobile variants, with the clock speeds we're running, we're over 800 gigaflops ... which is pretty impressive in a mobile chip running in a thin and light notebook."

Lansing was quoting performance of the top-of-the-line FX-7500 – which tops out at 818 GFLOPS, if you want to be exact – but AMD announced nine members of the mobile Kaveri family on Wednesday, three in a standard voltage group with a 35-watt TDP, three in an "ultra-low voltage" group at 19 watts, and a third group of three in a business-focused Pro series – more on that series in a moment.

AMD Kaveri for Mobile: product SKUs

As you'll read in a bit, 'TDP is probably the worst way to describe anything anymore' (click to enlarge)

Lansing also rolled out slides that compared a few of these new parts with allegedly comparable Intel offerings: an AMD FX-7500 with four CPU and six GPU cores versus an Intel "Haswell" Core i7-4500U, an AMD A10-7300 with four CPU and six GPU cores versus a Haswell Core i5-4200U; and an AMD A8-7100 with four CPU and four GPU cores versus a Haswell Core i3-4010U. Each of the Intel parts had two compute cores running a total of four threads, along with Intel HD Graphics 4400.

Lansing's competitive analysis was based three performance benchmarks: PCMark measuring system performance, 3DMark measuring graphics, and Basemark CL measuring compute. The Reg would like to emphasize that the results are numbers provided by AMD, and not our own testing – but according to the benchmark results Lansing provided, the AMD parts acquitted themselves quite well:

AMD Kaveri for Mobile: performance comparison with Intel

As might have been guessed, AMD easily outshines Intel in graphics performance (click to enlarge)

But these weren't the only performance numbers that Lansing provided. As we mentioned above, with the launch of the new mobile Kaveri line, AMD is also inaugurating a new AMD Pro A-Series, which he said is part of the company's goal of "reinvigorating our approach to commercial."

According to Lansing, the Pro Series will have a "better commitment to longevity," which he said is "really important to our business users – that we commit to them, that we'll continue to manufacture this product for them for a longer-duration life cycle."

The Pro Series will also provide a steady-as-she-goes bit of changelessness important to enterprise users. "We give you a software stack, an image," he said. "We'll commit to you that we'll maintain stability there, so you don't have to go and do driver and BIOS updates that in an IT environment is very much frowned upon."

The reason that AMD is inaugurating a Pro line, Lansing said, echoed much of what Macri had said about graphics becoming more important, and he believes that the increase of visualization in business plays into AMD's strengths.

"If you look at business usage 10 years ago, it was extremely 2D," he said, referring to spreadsheets, word processing, and data-driven types of work. Nowadays, however, graphics and multimedia-based workloads have come front and center in the enterprise, he argued, citing not only 3D data analysis and rendering, but also videoconferencing.

"So as workloads migrate from the old 2D world to the new 3D world," Lansing said, "what we can do with graphics and compute resonates much more with our business partners."

AMD's commitment to longevity and stability, along with "the new 3D world" – and also, most likely, AMD's history of lower prices than it main rival – apparently means something to HP, which announced on Wednesday that it is using Pro A-Series chips in its Elite business PCs, including the EliteBook 725, 745, and 755 G2 notebooks, EliteOne all-in-one, and EliteDesk desktop.

AMD is offering three APUs in its Pro A-Series line: the Pro A10-7350B and Pro A8-7150B, both with four CPU and six GPU cores, and the Pro A6-7050B, with two CPU and three GPU cores. Here's how they stack up, according to Lansing, against the same Intel parts in his previous comparison:

AMD Kaveri for Mobile: performance comparison with Intel – Pro Series

AMD's Lansing: the Pro line offers increased 'performance, longevity, and stability' (click to enlarge)

Oh, and he also threw in a two-core, two-thread Haswell Pentium 3556U, presumably just for shits 'n' giggles.

There's more to performance than just 'performance'

With all the talk about HSA, CPU tweaks, GPU upgrades, performance comparisons, and the like, we'd be remiss if we didn't mention some of the other goodies resident on the new Mobile Kaveri's die – namely, media processing for video and audio.

"When I talk about media playback, it's not about decoding a video stream and just presenting it," Macri said. "That's what everybody does. What we like to do at AMD is post-process it." What he was referring to are such post-processing niceties as removal of jitter, improvement of color gamut, edge sharpening, and the like.

Speaking of the 4K video–resolution capabilities of the new mobile Kaveri, Macri noted that although 4K is all well and good, there's very little 4K content available at present. "The best way to experience 4K today," he said, "is to take 1080p content and upscale it. Well, if you upscale it, you better have some great post-processing. Otherwise you're going to end up with a pretty ugly image on your beautiful monitor."

And guess what he says that his new APU has? Yup: great post-processing, enhanced by its HSA capabilities. "The ability for us to work on the right parts of the problem with the right hardware without moving the data improves performance, lowers power, gives us more accuracy," he said.

The new Kaveri mobile APUs also include three accelerators – coprocessors – that aid in media processing; one exists to enable AMD's TrueAudio technology.

AMD Kaveri for Mobile: TrueAudio block diagram

TrueAudio can provide Kaveri's CPU cores with some welcome vacation time (click to enlarge)

The TrueAudio coprocessor consists of multiple digital signal processors (DSPs) with onboard data stores totaling 384KB of shared memory, and with its own direct memory access (DMA) engine for streaming.

"Basically, if you're watching a movie," Macri said, "the only thing the CPU needs to do in this case is take the audio/visual stream, crack it, ship the video to our decoder, ship the audio to our [audio coprocessor], and that's all the CPU is doing."

This frees the CPU from having to waste its time and energy on audio matters, and since the TrueAudio coprocessor is a dedicated piece of hardware, the quality of service (QoS) can be enhanced – no skips or glitches that might occur when the CPU is called away for other tasks, Macri said.

The TrueAudio coprocessor is also programmable, so developers can take advantage of capabilities such as directional audio, noise cancelation, beam forming, and the like. "We already have some games that are starting to take advantage of this now," Macri said.

The new Kaveri mobile APUs also have a video coding engine (VCE) and a unified video decoder (UVD). The biggest change in the VCE from Kaveri's predecessors Trinity and Richland, Macri said, is the addition of the YUV444 color encoding, which he said "allows you to create perfect text."

AMD Kaveri for Mobile: Video Coding Engine (VCE) block diagram

Kaveri offers one important upgrade from its predecessors: crisp text over wireless (click to enlarge)

This is important because of the emerging 60GHz WiGig tech from Wilocity and others, which enables wireless docking to displays. "If you want to have a wireless dock, you want a wireless monitor with great text, you need to have the right encoding for it," Macri said – and YUV444 is that encoding.

Speaking of wireless connections, he claimed that "We have the lowest wireless latency out there: sub-40 milliseconds" – an obvious boon to gamers. And why is wireless technology important? Macri was clear on that point. "Wires are evil," he said. "I think they're just the worst thing in the universe."

There were also improvements made to the UVD between Trinity/Richland and Kaveri, the biggest being error resiliency. "When you're decoding," he said, "if you get an error, many times it can effect multiple frames. And what we've done with this design is limit it to just one frame."

AMD Kaveri for Mobile: Unified Video Decoder (UVD) block diagram

Playing a video that was encoded crappily? Kaveri's UVD can help (click to enlarge)

Errors are essentially inevitable, and can come from multiple sources: memory glitches or the quality of the encoding itself, for example. "You're really at the mercy of the encoder," Macri said – but if you can limit their effect to just one frame, they may be barely noticeable.

He also touted the many power optimizations in all areas of the new chips, saying that there are "thousands" of monitors scattered around the Kaveri die, some keeping track of temperature, and many more tracking activity. "What we're trying to do is understand what's happening out in the silicon in all the different blocks," he said – CPU and GPU cores and their subsystems, I/O, memory subsystems, data paths, caches, coprocessors and their subsystems, whatever – and use all that activity data as a proxy for actual temperature readings.

All that information is brought back to what he described as "basically a central processor," which is programmed differently for different members of the Kaveri family. That processor keeps track of what's happening throughout the die, then boosting, throttling, shutting down, and maintaining processes, cores, accelerators, or whatever to run the whole chip at maximum efficiency.

Briefly and selectively boosting a clock makes a mockery of stated TDPs, Marci said. "TDP is probably the worst way to describe anything anymore – we do it because it's easier," he argues.

"When you describe things to people, one number sticks, right? If I give you a transfer function, you look at me with googly eyes, right? But if I give you 15 watts it'll stick in your brain. Trust me, 15-watt systems are not 15-watt systems – we go boosting way above, we're moving all over the place, very quickly."

Since it takes time for heat to move around on a piece of silicon, he said, creative control of how one area on the die can act as a heat sink for another area makes it possible to squeeze every bit of performance out of Kaveri without blowing the power budget.

"Most importantly," he emphasized, "it takes into account what you're doing, when you're doing it, and reacts to it in an very unique way – because only you are going to do what you do the way you do it when you do it how you do it. And we will react to that dynamically. We won't react to it statically."

Despite that selective, dynamic boosting, however, battery life in a Kaveri-equipped system should be impressive, he claimed – including idle power. "One of the key things in life is to learn how to do 'nothing' well. Whether it's real life or computers, it's very important – or you'll burn yourself out for no reason."

Wrapping up, Macri returned to the importance of the heterogeneous system architecture – HSA – that underpins not just Wednesday's announcement of the new Kaveri mobile APU, but which now reaches across AMD's line.

"'Big A' architecture changes happen once every maybe 15, 20 years," AMD's CTO said. "HSA is a big one." ®