The Register® — Biting the hand that feeds IT

Feeds

AMD reveals potent parallel processing breakthrough

Upcoming Kaveri processor will drink from shared-memory Holy Grail

AMD has released details on its implementation of The Next Big Thing in processor evolution, and in the process has unleashed the TNBT of acronyms: the AMD APU (CPU+GPU) HSA hUMA.

Before your eyes glaze over and you click away from this page, know that if this scheme is widely adopted, it could be of great benefit to both processor performance and developer convenience – and to you.

Simply put, what AMD's heterogeneous Uniform Memory Access (hUMA) does is allow central processing units (CPUs) and graphics processing units (GPUs) – which AMD places on a single die in their accelerated processing units (APUs) – to seamlessly share the same memory in a heterogeneous system architecture (HSA). And that's a very big deal, indeed.

Why? Simple. CPUs are quite clever, speedy, and versatile when performing complex tasks with myriad branches, but are less well-suited for the massively parallel tasks at which GPUs excel. Unfortunately, they can't currently share the same data in memory.

In today's CPU-GPU computing schemes, when a CPU senses that a process upon which it is working might benefit from a GPU's muscle, it has to copy the relevant data from its own reservoir of memory into the GPU's – and when the GPU is finished with its tasks, the results need to be copied back into the CPU's memory stash before the CPU can complete its work.

Needless to say, that back-and-forthing can consume a wasteful amount of clock cycles – and that's the limitation that AMD's upcoming Kaveri APU, scheduled to appear in the second half of this year, will overcome.

AMD's hUMA architecture: comparison of memory systems in CPU, APU, and APU with heterogeneous system architecture

With hUMA, CPU and GPU memory is united in one cache-coherent space (click to enlarge)

The secret sauce that Kaveri will bring to the computing party is hUMA, a scheme in which both CPU and GPU can share the same memory stash and the data within it, saving all those nasty copying cycles. hUMA is cache-coherent, as well – both CPU and GPU have identical pictures of what's what in both physical memory and cache, so if the CPU changes something, the GPU knows it's been changed.

Importantly, hUMA's shared memory pool extends to virtual memory, as well, which resides far away – relatively speaking – on a system's hard drive or SSD. The GPU does need to ask the CPU to tell the system's operating system to fetch the required data from virtual memory, but at least it can get what it wants, when it wants.

AMD's hUMA architecture: uniform memory access

In a hUMA system, the GPU can access the entire memory space, virtual memory included (click to enlarge)

At this point, you might well be asking, "All well and good, but what's in it for me?" Glad you asked.

From a user's point of view, hUMA will make CPU-GPU mashups – in AMD parlance, APUs – more efficient and snappier. Better efficiency should improve battery life and make hUMA-compliant processors more amenable to tablets and handsets. Snappier performance means, well, snappier performance.

From a developer's point of view, hUMA should make it significantly easier to create apps that can exploit the individual powers of CPUs and GPUs – and, for that matter, other specialized cores such as video accelerators and DSPs, since there's no compelling reason that they should be forever locked out of hUMA's heterogeneous system architecture party.

Developers shouldn't have much trouble – if any – exploiting hUMA, since AMD says it will be compatible with "mainstream programming languages," meaning Python, C++, and Java, "with no need for special APIs."

Also, it's important to note that although AMD was the company to make the hUMA announcement and will be the first to release a hUMA-compatible chip with Kaveri, the specification will be published by the HSA Foundation, of which AMD is merely one of many members along with fellow cofounders ARM, Imagination Technologies, Samsung, Texas Instruments, Qualcomm, and MediaTek. Should some – all? – of these HSA Foundation members adopt the shared-memory architecture, hUMA goodness could spread far and wide.

In fact, hUMAfication already appears to be on the way – and not necessarily where you might have first expected. AMD is supplying a custom processor for Sony's upcoming PlayStation 4, and in an interview this week with Gamasutra, PS4 chief architect Mark Cerny said that the console would have a "supercharged PC achitecture," and that "a lot of that comes from the use of the single unified pool of high-speed memory" available to both the CPU and GPU.

Sounds like hUMA, eh? ®

Re: What was UMA architecture then?

The key difference is not on the diagram. When a process on a CPU tries to access some memory, the address that the process selects is a virtual address (back then: a 32-bit number, now often a 64-bit number). The CPU tries to convert the virtual address into a physical address (a different number, sometimes a different size). There are several uses for this rather expensive conversion:

Each process gets its own mapping from virtual to physical addresses - this makes it very difficult for one process to scribble all over the memory that belongs to a different process.

The total amount of virtual memory can exceed the amount of physical memory. (Some virtual addresses get marked as a problem. When a process tries to access such a virtual address, the CPU signals this as a problem to the operating system. The operating system suspends the process, assigns a physical address for the virtual address, gets the required data from disk into that physical memory then restarts the process.)

Sometimes it is just convenient - the mmap function makes a file on a disk look like some memory. If a process tries to read some of the mapped memory, the operating system ensures data from the file is there before the read instruction completes. If a process modifies the contents of mapped memory, the operating system ensures the changes occur to the file on the disk.

In UMA, the CPU and the GPU access the same physical memory, but the GPU only understands physical addresses. When a process wants some work done by the GPU, it must ask the operating system to convert all the virtual addresses to physical addresses. This can go badly wrong because a neat block of virtual addresses could get mapped to a bunch of physical addresses scattered all over the memory map. Worse still, some of the virtual memory could map to files on a disk and not have a physical address at all. The two solutions are to have the operating system copy the scattered data into a neat block of contiguous physical addresses or for the process on the CPU to anticipate the problem and request that some virtual addresses map to a neat contiguous block of physical addresses before creating the data to go there.

Plan B looks really good until you spot that the operating system might not have such a large block of physical memory unassigned. It would have to create one by suspending the processes that use a block of memory, copying the contents elsewhere, updating the virtual to physical maps and then resuming to suspended processes. It gets worse. That huge block of memory cannot be paged out if it is not being used, and the required contents might already be somewhere else in memory so it will have to be copied into place instead of being mapped.

All this hassle could be avoided if the GPU understood virtual addresses. That would cut down on the expensive copying (memory bandwidth limits the speed of many graphics intensive tasks). The down side is it adds to the burden of the address translation hardware which is already does a huge and complicated task so fast that many programmers do not even know it is there.

11
1

This is pleasing

More good news technology stories like this and I shall renew my subscription.

9
0

Re: This is pleasing

Be warned, you need a sense of hUMA.

No, the *dirty* Mac, thanks.

5
0

Re: What was UMA architecture then?

In their current APUs the GPU doesn't interact with memory in the same way as the CPU does. That's in spite of the fact that they're on the same die and ultimately share the same DDR3 memory bus. In that's sense the arrangements are slightly Non Uniform, and you have to copy data in order to get it from one realm to another.

This new idea means that the GPU and CPU interact with memory in exactly the same way, and that makes a big difference. Software is simpler because a pointer in a program in the CPU doesn't need to be converted for the GPU to be able to use it. That helps developers. More importantly the "GPU job setup time" is effectively zero because no data has to be copied in or out first. That speeds up the overall job time.

I like it!

3
0

Re: No more worrying about Graphic card memory

A co-processor is a co-processor. If it can act in place of the CPU with less nonsense then that's useful regardless of whether or not the co-processor is on the same die. This is just turning a GPU into a fancier math co-processor.

Surprised it hasn't been done yet actually.

2
0

More from The Register

Pirates scoff at games dev sim's in-game piracy lesson
Dev seeds cracked version of 'Game Dev Tycoon', watches as Pirates run rampant
Fanbois vs fandroids: Punters display 'tribal loyalty'
Buying a new mobe? You'll stick with the same maker - survey
iPhone 5 totters at the top as Samsung thrusts up UK mobe chart
But older Apples are still holding their own
Google to Glass devs: 'Duh! Go ahead, hack your headset'
'We intentionally left the device unlocked'
Japan's naughty nurses scam free meals with mobile games
Hungry women trick unsuspecting otaku into paying for grub
 breaking news
Turn off the mic: Nokia gets injunction on 'key' HTC One component
Dutch court stops Taiwanese firm from using microphones
Next Xbox to be called ‘Xbox Infinity’... er... ‘Xbox’
We don’t know. Maybe Microsoft doesn’t (yet) either
Sord drawn: The story of the M5 micro
The 1983 Japanese home computer that tried to cut it in the UK
Nudge nudge, wink wink interface may drive Google Glass
Two-finger salutes also come in handy, as may patent lawyers
Black-eyed Pies reel from BeagleBoard's $45 Linux micro blow
Gigahertz-class pocket-sized ARM Ubuntu rig, anyone?