Graphics and media: much better this time – we promise
The performance of Intel's integrated, on-chip graphics has never been what one might call stellar. The company's graphics guru, Senior Fellow Tom Piazza, says that Haswell will not only provide significant improvements over the performance of the Sandy Bridge and Ivy Bridge graphics cores, but further improvements will appear in future generations as well.
Piazza reminded his IDF audience that in 2008 the company's goal was to get a ten-fold improvement in integrated graphics performance by 2010 – a goal dubbed 10X by 10, and one that Intel said it had achieved with the release of Sandy Bridge.
"I remember being here a few years ago with the 10X by 10 in Sandy Bridge, and I remember saying 'This is just the beginning'," he said. "I'll sit here and tell you today that Haswell is not the end. We're going to keep on going from here."
Intel's goal with the Haswell graphics architecture was – as is true with the compute cores as well – to improve performance without sacrificing power efficiency.
Like the compute cores, Haswell's graphics draw heavily from the graphics cores in the Sandy Bridge and Ivy Bridge architecture – a "similar microarchitecture with some embellishments on it," as Piazza put it. Some of those embellishments are designed to help performance – significantly so in media handling – hand-in-hand with power savings. The graphics subsystem also falls in line with the compute cores in that it focuses on both modularity and scalability.
Ah, if graphics and media engineering were only as simple as drawing pretty pink and blue boxes (click to enlarge)
As an example of the design's scalability, Haswell graphics will be offered in three flavors: GT1, GT2, and GT3 (with the same driver stack), with the highest level doubling many of the sub-elements of the graphics core, but doing so without increasing its power requirements in some scenarios, Piazza said.
One trick that Haswell uses to accomplish this is to decouple the clocks of the compute cores, the graphics core, and the ring bus that connects them – they were locked together in Sandy Bridge and Ivy Bridge. In those two architectures, if you wanted to "turbo up" the graphics core and the ring bus when extra performance was desired, you had to raise the voltage and clock speed of the compute cores as well – a power waster.
In Haswell, Piazza explained, "The ring is totally isolated – a separate domain – and we can move the ring up and down, the graphics up and down, and the CPU up and down independently."
If you're running a Haswell part with GT3 graphics, for example, you have so much additional graphics performance that you can run the graphics innards at lower power, raise the ring bus just a bit to take advantage of all of GT3's bandwidth, but drop the compute cores' power down to a lower voltage state to save juice.
There are a host of activities going on in Haswell's scalable graphics core, even in its most basic form (click to enlarge)
The basic building blocks of the Haswell processor, as mentioned above, are similar to those in the two previous generation, but with the addition of what Piazza called "some goodies in there for new features." There is, for example, a new resource streamer that handles in hardware many of the buffer-management tasks that otherwise would be handled by driver software.
The GT3 implementation is particularly interesting – look for it in higher-end client Haswell chips when they start shipping, likely late next year or early in 2014. Compared with GT2, GT3 has twice the amount of raw pixel and shader performance clock-for-clock, more vertex throughput, twice the rasterization and pixel backend, and twice the Level 3 cache – two interleaved Level 3 caches, to be exact – to feed the beast with about half a terabyte of bandwidth within the chip.
Piazza would not be drawn out, however, on how many execution units (EUs) are in each of the three GT implementations. When asked how many general-purpose computing on GPU (GPGPU) single-precision flops the Haswell chips would be able to achieve, he demurred. "Now you're asking me to tell you how many shader units, etcetera," he said, "and I'm not going to disclose that."
He did, however, offer a hint. "Since we're not saying how many shaders, each of our shader units can do eight FMAs [fused multiply-adds – remember them from the compute core?] per flop."
Haswell's GT3 is like Wrigley's gum: double your pleasure, double your pixels, shaders, bandwidth... (click to enlarge)
In addition to these graphics tune-ups, Haswell also features a raft of media-processing improvements, including video-codec expansions and video- and image-processing enhancements, with – of course – power management and optimizations thrown in for good measure.
Haswell, for example, adds decoding for SVC – scalable video coding, an extension to the H.264 codec standard – to the AVC, VC1, and MPEG2 support that's in the Ivy Bridge. SVC, said Intel Fellow Hong Jiang, is a "key enabler" for multi-participant video conferencing and streaming-media servers.
Also aboard Haswell are a motion JPEG (MJPEG) decoder for low-power USB-webcam video conferencing, and an MPEG2 hardware encoder useful for DVD creation and DLNA (Digital Living Network Alliance) digital media streaming – think music, photos, and video for home entertainment use.
Media enhancements include a dedicated Video Quality Engine (VQE) that adds color-gamut expansion – which Hong said maintains color saturation and improves visual quality — plus tunable skin-tone image enhancement, frame-rate conversion, and image stabilization to the suite of video-quality enhancements already in Ivy Bridge. Those include such niceties as de-noise and de-interlace circuitry, as well as adaptive contrast enhancement and more.
Media improvements hew to Intel's marketing push: it's all about 'the experience' (click to enlarge)
As with the graphics core, the GT1, GT2, and GT3 versions have a rising range of media-handling capabilities, with the GT3 version having two times the media-sampling and VQE throughput as does GT2.
Haswell also adds support for 4K video – which, depending upon the industry 4K standard you prefer, refers to video resolutions of 4096-by-2304, 4096-by-2160, or 3840-by-2160 pixels. That last one, by the way, is also known as QFHD (quad full HD), and was recently adopted by Sony for its 4K video.
Hong included in his presentation a slide that showed the powers of Haswell's tunable skin-tone processing, which works its image alchemy by examining a pixel and its neighbors, calculating the likelihood that the pixel is displaying a skin color, then adjusting the the pixel and its neighbors to display the amount of detail desired.
With this capability – which Intel in its engineering geekiness calls the Skin Tone Tuned Image Enhancement Filter – you can merely enhance the skin tones in your subject's image, crank them up with higher contrast, or smooth them out – great news for those aging subjects who might want their wrinkles de-emphasized.
Such candidates for facial de-emphasization would most certainly include your aging Reg reporter, who after days of sitting in darkened IDF technical sessions packed with high-level chipheads emerged feeling rather wrinkly, indeed – with said wrinkles appearing not only on his tired visage, but also in his equally exhausted brain. ®
Deep, deep dive inside Intel's next-generation processor
Software decently written?
"The key here is that most software is actually fairly decently written"
He must be joking...
Pentium 4 was deemed as having very poor performance because to take advantage of it, software needed to be "fairly decently written", and compiled with a decent compiler. The problem is that to date there is only one compiler worth a damn for x86 - Intel's own (ICC).
I did some performance testing a while back:
Clock-for-clock, with crap compilers (GCC, PGCC) Pentium 4 is about 40% slower than Pentium 3. But with ICC, Pentium 4's performance actually goes up by 20%, clock-for-clock, compared to a Pentium 3.
It's not just down to software being decently written (which it isn't a lot of the time) - it's also down to the compiler doing a decent job (which most don't). On one hand, one could argue something along the lines of: "Pentium 4 didn't suck - you were merely too stupid to use it properly." Unfortunately, this is way, way beyond the average consumer to either understand or do anything about and it is the consumer's perception that decides whether a product is going to be a success or failure.
Wait ... what?
"Intel CEO Paul Otellini has called "the third pillar of computing," security – the other two pillars being energy efficiency and internet connectivity."
This twat is a marketard, not an engineer ... he wouldn't know the difference between ones & zeros if he got 'em under his carefully manicured fingernails. He's part of the (current) reason Intel is heading for the bit-bucket.
By way of reference, the real three pillars of computing are memory, IO, and CPU ...
"We thought that you, inordinately intelligent and tech-savvy Reg reader, might enjoy a deep dive into their handiwork."
Have you read all the comments on the register? :P
Re: that hidden message
Darker-skinned folks faces are OK to start with.
Re: AVX2 on integers
The AVX2 (long long) integer operations will be in there for cryptographic processing.
The really interesting bit is the transactional memory TSX extensions (IBM’s is already well along the curve). TSX should be a big kicker for TP and HPC, but writing software to take advantage of it is going to take a big paradigm shift away from Garbage-collection to Bedouin memory management