Imagination uncloaks 'over 20X faster' smartphone GPU
'Rogue' embraces heterogeneous future
CES 2012 Imagination Technologies has revealed a pair of new GPU IP cores that it claims will supply new heights of performance to the likes of Apple, TI, Samsung and others who use the UK company's low-power graphics cores.
"You are now talking about, from a gigaflop point of view, supercomputing in hundreds of milliwatts," Imaginations's VP of marketing Tony King-Smith told The Reg about the company's new designs.
The first in the PowerVR Series6 GPU IP cores announced at CES, the PowerVR G6200 and G6400, are based upon Imagination's new architecture the company calls "Rogue". Strictly speaking, Rogue isn't new in the sense of "new from the ground up". Rather, as PR director David Harold told The Reg, "It's an evolution, but there are actually a lot of new features involved."
Harold told us that there's been a "shift in the dialog" about GPU IP cores, from considering just the more traditional metrics of how many polygons a GPU core can push, to how the GPU's compute capability can supplement the CPU in a heterogeneous CPU/GPU system.
"We're talking about compute mattering more," Harold said, "So the architecture we're defining is about these compute clusters – these arrays of compute elements that go together." The G6200 comes with two of these compute clusters, and the G6400 has four.
"Within a Rogue core," King-Smith told us, "there's two or four clusters of execution elements, and each execution cluster is actually an array execution units. You're building basically layers of hierarchy."
In the company's earier PowerVR SGX Series5 GPU IP core family, King-Smith explained, the model was execution pipelines, with each being essentially a data path. "Here," he explained, "we're talking about our lowest building block, instead of a single pipeline or data path, is now an execution cluster. It is an array of pipelines as the lowest building block."
According to King-Smith, "The big thing about Series6 and the GPUs is that they are parallel processors – true parallel processors, not just parallel pipelines," he said.
Move over, CPU
"This is why they're taking over from the CPU. If you're doing anything that's compute-intensive, you do it on the GPU, not the CPU, because the horsepower – the raw density of processing per square millimeter and per milliwatt – is vastly different," he explained.
There are, of course, coding trade-offs that need to be made to harness this horsepower. "You don't take C code and put it on a GPU," King-Smith said. "We keep talking with these people who say, 'We've got these tools that'll analyze you C code, and identify the parallelism.' No. That isn't the way that this is going to work."
To take advantage of the parallel-processing capabilities of the Rogue architcture, developers will need to design algorithims to take advantage of it – but they have help in the form of the Khronos Group's open standard, Open CL.
"The fundamental is that what you run on the GPU is low-level – think of it more like a DSP," King-Smith said, giving workload examples of an audio codec, a cryptography task, or what he referred to as "the classic" function, image processing.
"Most of your code will actually stay in C on the CPU," he said, "[except] the heavy lifting stuff. If you're trying to optimize on the CPU, that's increasingly the wrong way to do it." Instead, developers will use OpenCL APIs to offload parallel tasks to the GPU.
Imagination has developed a full set of OpenCL drivers for not only the new Series6, but also for its predecesor Series5 GPU IP cores, which have been around for some three years.
"Series5 wasn't designed for GPU compute, but the Series6 most definitely is – without losing the fact that it's a great graphics engine, as well," King-Smith hastened to add.
So how well does all this cluster-based chippery perform? Quite well, Imagination claims. "We're talking about cores using the Rogue architecture – part of our Series6 – which start from around 100 gigaflops and go up into the teraflops territory," as cores are added, Harold said.
King-Smith emphasized that performance in the range of a couple of hundred gigaflops could be achieved at mobile power-consumption levels. "Which is fairly dramatic," he said.
That drama comes for performance increases that a company release pegs as "20x or more of the performance of current generation GPU cores targeting comparable markets ... enabled by an architecture that is around 5x more efficient than previous generations."
While those numbers may seem rather extreme, King-Smith assured us that they weren't. "We're always trying to be very conservative about the numbers that we talk about," he said. "These numbers are very, very real," as he put it, "realized in a real benchmark running on a real chip."
This very, very real performance will show up in both tablets and smartphones, but Harold emphsized that the designs' power profiles are suitable for smartphones. "Smartphone is the core market. We would not design anything that is not for smartphones."
To find a home for the Rogue architecture in power-sipping smartphones, Harold told us that the company is aiming for the same level of power hunger per millimeter squared as achieved by previous Imagination GPU IP cores – with one caveat: the Series6 is bigger than its predecessors, so the number of square millimeters is greater, process size being equal.
But process sizes are shrinking, and - if Imagination's very, very real numbers are, indeed, very, very real – a Series6 core can get a lot more work done per watt than the Series5 and earlier, so it should be able to snuggle into smartphones quite comfortably.
Although Imagination has delivered the design to eight lead partners, it won't divulge who they might be – though one can only assume that current players such as the aforementioned Apple, Texas Instruments, and Samsung are in the mix.
As Harold reasonably pointed out, it's not the GPU IP core designer's place to announce when a partner's parts will begin shipping – we'll get that news from the chipmakers themselves. But he did say that some partners are in developmental stage, while others have already developed parts.
"We would expect that as we get a few more months into this year," he said, "then we'll start to hear things from our partners about timescales."
"No one wants an IP supplier to go about shouting about their product plans," he said. When reminded that there are other IP companies that do, indeed, announce partners and parts, Harold revised that statement.
"No IP company that wants to maintain the kinds of customer intimacies that we have really wants to go around behaving like that," he said. ®
Those who follow the evolution into a heterogeneous-computing world are familiar with the discusions – arguments? – about unified versus separate caches for CPU/GPU mashups. In our discussion, King-Smith expressed no doubts. "A CPU should have a CPU cache," he said, "and a GPU should have a GPU cache."
shift in the dialog
Could we maybe shift it back? It's really hard to filter out any information form all the marketing shit.
"If you're doing anything that's compute-intensive, you do it on the GPU, not the CPU"
Not just anything, only tasks suited for it, ie independent calculations and parallel logic. Normal user software has a hard time scaling on multiprocessor, for the exact reason that the task at hand is largely sequential. Graphics stuff, specific maths (like photo/video handling) are very parallel in nature (the reason SIMD was invented) and thus can move to the GPU
...was the best of a bad bunch - All the GMAs were/are shit but this one was shit in a different way.
If you could actually get the thing to work it was good - but that was fairly difficult thanks to bad support and bad documentation.
Good luck with drivers
I'm sure this GPU core is revolutionary and amazing but I cant shake the mental connection that whenever I see the name PowerVR I immediately associate it with the shit intel GMA500 embedded graphics and their shoddy driver support.
So it'll probably be a lovely graphics core for an iphone or windows mobile but you'll be straight out of luck for android/linux drivers
Development effort the problem
The real problem is that with the very real read-back latencies to the GPU, the bitch of time you have getting all those nice SIMD ALUs to be more or less taking the same branches at the same time you have to expend a horrific amount of development effort to actually get this to scale without hitting one of say 16 bottlenecks, and its not even clear that we are truly doing everything with SSE (or Neon) + some cores that we could.
Some things will go this way... it's take a time and be DSP and niche-like for a long time. Part of the reason why the cache sharing debate is important is that CPUs can sometimes outperform GPUs with high locality operations (e.g. < 64K) that can be stored in-cache. GPU caches also tend to be skimpy on the low end and if you have a shared bus the GPU really can't clock the data out any faster than the CPU. So, distinct bus + distinct memory then improved performance.
Personally I don't think that this will really become mainstream until there is actually a unified set of SIMD/MIMD cores behind the same cache and an ISA that can rapidly switch between MIMD and SIMD types of operations. Oh, and the CPU needs to be able to read swizzled texture memory.