Original URL: http://www.theregister.co.uk/2012/09/20/intel_haswell_microarchitecture_deep_dive/

Deep, deep dive inside Intel's next-generation processor

Join us on a whirlwind Haswell holiday – non-geeks heartily welcomed

By Rik Myslewski

Posted in Hardware, 20th September 2012 17:50 GMT

At Intel's developer shindig last week, chippery engineers spent a goodly amount of time conducting tech sessions that detailed the company's upcoming 4th-generation Core microprocessor architecture, code-named "Haswell."

We thought that you, inordinately intelligent and tech-savvy Reg reader, might enjoy a deep dive into their handiwork.

The new Haswell microarchitecture – likely named after the tiny Colorado town and not the Australian red-groined froglet – was touted by Intel's Architecture Group headman David Perlmutter as being "designed with mobility in mind." In pursuit of that goal, he said that Haswell will require just one-twentieth of the idle power – that's full platform power, not just CPU power – of the second-generation core processors, code-named Sandy Bridge.

Perlmutter emphasized Haswell's future appearance in "sleek tablets" and Ultrabooks, followed "eventually" by desktops and workstations. In the more-technical sessions that followed Perlmutter's rather fluffy keynote, however, Intel engineers added data centers to Haswell's future turf.

During those sessions, a wealth of Haswell details were shared, explaining how Intel is counting on 22-nanometer Haswell chips to be faster, more power-miserly, and more media-friendly than their predecessors – and to finally move Intel into the tablet and handset market that continues to elude it.

Just like Ivy Bridge, except when it isn't

The design of Haswell's compute cores – two to four of them when it first appears in its client incarnations late next year – is an evolutionary, not revolutionary step beyond those in today's 22-nanometer, 3rd-generation Ivy Bridge processors.

"The starting point is what came before it with Sandy Bridge and Ivy Bridge," Intel engineer Ronak Singhal told attendees at one Haswell session, citing Turbo Boost technology, hyperthreading, integrated graphics on the same die as the compute cores, the ring interconnect between the various elements such as the computer and graphics cores, and the shared cache between those two core types.

"All of those are features that are carried over into the Haswell generation," he said. In many cases, however, those technologies had been tweaked – more on those in a moment.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Meet Haswell, Intel's next-generation all-purpose processor microarchitecture (click to enlarge)

As The Reg told you last week, one of Haswell's key features is that its high degree of modularity will allow it to be used on a broader range of processors than Intel has attempted before in any of its previous microarchitectures – one core to rule them all, as it were.

Different usage models, of course, will require different power levels. How much juice you can feed a Haswell-based chip will be one factor determining how high its performance will be. A higher-performing chip will require more power – as if you didn't already know that.

But determining a processor's power usage is not simply a matter of deciding how much juice to feed it – there's a lot of in-chip dynamic power-management going on, and the Haswell engineers focused intently on tweaking the architecture's capabilities in that regard.

In addition to those tweaks, Haswell carries over a number of power-management features of Sandy Bridge and Ivy Bridge. Those chips have essentially two classes of power states: active and sleep. Haswell provides both of those two states, as well, but adds a new state that Intel calls "active idle".

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Those blue circles hiding behind Haswell's green power states belong to Mr. and Ms. Sandy and Ivy Bridge (click to enlarge)

In this state, as explained by Intel Fellow Per Hammarlund, "The OS and the software on top of the hardware thinks that the hardware is active ... but in reality we're achieving power levels that are associated with the previous idle state."

This new active idle state is what enables Haswell to achieve the 20X improvement Perlmutter referred to in his keynote. "This is really what enables the key benefit in battery life for Haswell," Hammarlund said, noting that state changes can occur in milliseconds, or at most hundreds of milliseconds – swift, indeed.

If you're a developer, fear not about re-coding your apps to take advantage of this new capability. According to Hammerlund, it's all handled in hardware combined with firmware, and it will all be done for your app automagically and continuously.

He did offer the caveat that said magic will work only for "well-written software", but noted that "The key here is that most software is actually fairly decently written and will take advantage of these power modes ... and you will get these 20X idle power improvements for free."

In addition, Intel has improved power management by lowering the power required by both Haswell's active and sleep states, and by improving the transition time from power to sleep. The transition time from active idle to sleep, the company says, is also quite snappy.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Remember megahertz and gigahertz marketing? In today's mobile world it's all about power management (click to enlarge)

The active idle state is not the only new hotness in Haswell's power management. The temporary clock-speed boosting Turbo Mode has been tweaked to be more power-efficient, for example, as well as being extended upwards into higher gigahertz levels for more performance headroom.

One way the extra voltage has been freed to give Turbo Mode more boost is by decoupling the voltage and frequency of various elements on the die from one another to allow for more fine-grained power control. This provides the ability to better shift power from where it's not needed to where it is.

Other tweaks include finer-grained control of which parts of the die are on and off at any one time. "In reality, it's mostly about making sure everything is off all the time," Hammarlund said. "If you don't need it, it's off. That's the philosophy."

True chip geeks will be happy to know that Haswell has additional and deeper C-states – power modes – and that the transition times between C-states have been improved by as much as 25 per cent. Not a true chip geek, sir or madam? Don't worry about it – all this means is that your Haswell-equipped mobile device's battery will likely last longer.

Compute-core competencies

Although Haswell's compute cores are based on those used in the now-familiar 32nm Sandy Bridge and its follow-on 22nm Ivy Bridge, Intel has added a number of changes that should improve performance.

"The first thing to look at on the performance side," said Intel engineer Singhal, "is 'What are we doing for the software that exists today?'" One important factor in this effort, he said, was to keep all the compute-core pipelines essentially the same as they have been through Sandy Bridge and Ivy Bridge.

Intel has, however, made what he characterized as "significant" changes within the cores, including increasing the depth of the core's buffers, which gives the cores more flexibility in Haswell's out-of-order execution – the chip's ability to better optimize the flow of instructions and data, which also helps use more parallelism in execution.

Haswell has also improved branch prediction – the chip's attempt to accurately guess the correct path for data flow before it actually knows which way it will proceed in if-then-else processing. The more accurate a chip's branch prediction, the less frequently it will need to start a branch over it if guesses wrong, which wastes time.

And wasting time, in contemporary chip engineering, is worse than wasting money – it's wasting power.

According to Singhal, branch-prediction improvement is "something we tend do every generation." Additionally, the fact that the execution pipelines haven't been lengthened from the ones in Sandy Bridge and Ivy Bridge assures that the "do over" time for an incorrect branch prediction is also not lengthened.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Haswell may be based on previous architectures, but it has plenty of new tricks up its silicon sleeves (click to enlarge)

Instruction buffers at the top end of Haswell's pipeline have been enlarged, which Singhal says will help the performance of apps that have a large code footprint, "which we're seeing become more and more common." They'll also help to improve Haswell chips' chances of initiating efficient code-execution parallelism – more on that in a moment, as well.

Sandy Bridge and Ivy Bridge chips can execute six operations per clock cycle; Haswell increases that number to eight. One of the "ports" for those new execution units now supports an additional integer arithmetic logic unit (ALU), which – you guessed it – provides another place to accomplish integer arithmetic or logical operations.

A new address-storing port has been added, as well, which will free up some of that duty from two existing ports.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Two new ports – no waiting – plus the efficient fusing of multiply and add functions (click to enlarge)

One key newbie to Haswell's microarchitecture is the addition of two fused multiply-add (FMA) floating-point operation units. One slick thing about FMA is that since it can compute both a floating-point add and multiply in the same cycle, doing so not only saves clock cycles but also creates the opportunity to round the result once, and not twice as required if the operations were done separately. This single-rounding capability improves mathematical precision.

In addition to these improvements over the existing Sandy Bridge and Ivy Bridge architectures, Haswell also introduces some new technologies and capabilities. Some of these new goodies, however, will require some software changes. "Ideally," Singhal said, "just a recompile, but in other cases a re-optimization of the code."

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Is it just us, or do ports 0, 1, and 5 seem rather more well-rounded in their abilities? (click to enlarge)

One of these new capabilities is a set of compute instructions called AVX2, a beefing up of the AVX (advanced vector extensions) instructions introduced in Sandy Bridge – and no, Haswell's AVX2 has nothing to do with that other AVX2 that you content-creation types may know of from Avid.

Simply put, AVX – and now AVX2 – instructions are extensions to Intel's long-established SSE instructions that can operate on multiple data words in parallel, using a single instruction.

Here's a geek-cred tidbit: SSE is a bit of a recursive acronym – or initialism, for you purists among us – that stands for "streaming SIMD extensions"; SIMD stands for "single instruction, multiple data." (SIMD, by the way, is a true acronym, pronounced "sim-dee".) Drop that into conversation at your next cocktail party and watch members of the opposite sex swoon at your erudition. Or of your own sex, should that be your preference. Whatever.

What you need to know about AVX2 is that it will allow clever coders to speed up applications that are floating-point intensive by using FMA to double both single-precision and double-precision floating point operations per clock cycle per core. According to Intel, expect noticeable speed-ups in such applications as image and audio/video processing, scientific simulations, financial analytics, and 3D modeling and analysis.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

If you're an old Intel hand, you'll appreciate the vast improvement of AVX2 over 1996's Pentium with MMX (click to enlarge)

AVX2 improvements don't stop with the FMA-enabled doubling, however. There are also new instructions for data permutation and shuffling, and a new gather instruction allows you to load data from multiple non-contiguous places in memory, reducing latency and freeing you from having to perform pesky hand-coded data bookkeeping.

Finally, AVX2 goes beyond AVX in that it can also handle integer instructions, not just vector instruction. Your humble Reg hack is not completely clear as to exactly why that's so nifty – although I could venture a semi-educated guess or three – but Singhal's IDF audience seemed duly impressed.

Caching in, big-time

It's all well and good to have increased computing capabilities, but as Singhal said, "In order to actually use those functional units, you have to be able to feed them, you have to be able to get data to them." And that's where Haswell's improved caches come into play.

It's not that Haswell's Level 1 and Level 2 caches are any larger than those of Sandy Bridge and Ivy Bridge – they're not. What has been improved is the chip's ability to get data into and out of those caches more quickly.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

You gotta like the improvement curve since 2008's Nehalem. Ah, how the time does fly (click to enlarge)

As with Sandy Bridge and Ivy Bridge, Haswell's L1 instruction and data caches remain at 32KB, and its L2 cache remains at 256KB. What's different is the performance that Haswell delivers from those caches by doubling the read and write ports to 32 bits and the load ports to 256 bits – that's a lot of bits being made available to the execution units per cycle.

This is a boon to AVX2: Haswell can sustain two full AVX2 reads and a full AVX2 write in a clock cycle, according to Singhal.

In addition to these bandwidth improvements, Singhal pointed to Haswell's improved cache-line latency and the elimination of cache conflicts in the new microarchitecture. "Previously there were cases when two loads may not be able to read the cache simultaneously because of the banking architecture that we used," he said. "We've removed those restrictions."

The take-away? Haswell can get more data to its execution units without those workhorses having to waste time and power twiddling their digital thumbs. The result? Better performance and better power savings.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

A thick pile o' numbers, to be sure, but if you care about cache performance, we wager you'll like what you see (click to enlarge)

The L1 cache, as described above, was not the only beneficiary of Intel engineering love. The L2 cache was improved, as well. In Sandy Bridge and Ivy Bridge, a line could be read from the cache every other clock cycle; now it takes just one clock cycle to do the same.

Before we step out of Haswell's compute cores and sidle on over to its graphics core, there's one more new bennie to talk about: what Intel calls transactional synchronization extensions, or TSX.

TSX works to improve parallel processing. "We have parts today that we're shipping on the client side that go up to eight [execution] threads," Singhal said. "On the server side we support up to 20 threads per socket today, so if somebody puts together a four-socket system, they're supporting already up to 80 threads – and of course core counts will continue to go up on the server side."

To take advantage of those parallel threads, you of course need to develop parallel software. Duh. But that task can be a complete pain in the yinger, especially if all of those threads are working on the same data set in conjunction with one another.

Enter TSX, which endeavors to move the work of low-level optimization from the code writer over to the hardware upon which their workloads are running.

I won't dig deep into the intricacies of TSX except to say that it involves two technologies: hardware lock elision (HLE) and restricted transactional memory (RTM). Both require coders to insert what are essentially lock and unlock commands into their software that tell Haswell when to search out parallelism opportunities that are not explicitly written into the code itself.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Okay, okay, okay – parallel programming is hella hard, so how abut some hardware help? (click to enlarge)

According to Singhal, HLE is more suitable for legacy code – developers will simply insert the XACQUIRE and XRELEASE commands into their code as appropriate, and RTM's XBEGIN and XEND commands give the devs more wiggle room. "[RTM provides] a little more flexibility, but a little more work, as well," he said.

Before we leave the compute cores and move on to Haswell's graphics and media enhancements, let's take a quick look at one last bit of compute-core goodness: advancements in the microarchitecture's ability to help shore up what Intel CEO Paul Otellini has called "the third pillar of computing," security – the other two pillars being energy efficiency and internet connectivity.

"Today, cryptography is huge," said Intel CPU architect Bret Toll at one IDF technical session. "It's very important. Every time you get on the web and do any kind of transaction, it gets encrypted and decrypted, sometimes multiple times."

Those cryptographic functions take time and processor power, so to speed them Intel has added new crypto-supporting instructions to the Haswell microarchitecture and improved existing architectural features that support encryption and decryption. The wide vectors in AVX2 provide more cryptographic oomph, as well.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

A bright, shiny dime to the first Reg reader who can decode each and every entry in this alphabet soup (click to enlarge)

Toll told his audience that although past microarchitecture generations had seen improvements in support for encryption and decryption, "As you can see, on Haswell I think we've hit it out of the park with security."

Graphics and media: much better this time – we promise

The performance of Intel's integrated, on-chip graphics has never been what one might call stellar. The company's graphics guru, Senior Fellow Tom Piazza, says that Haswell will not only provide significant improvements over the performance of the Sandy Bridge and Ivy Bridge graphics cores, but further improvements will appear in future generations as well.

Piazza reminded his IDF audience that in 2008 the company's goal was to get a ten-fold improvement in integrated graphics performance by 2010 – a goal dubbed 10X by 10, and one that Intel said it had achieved with the release of Sandy Bridge.

"I remember being here a few years ago with the 10X by 10 in Sandy Bridge, and I remember saying 'This is just the beginning'," he said. "I'll sit here and tell you today that Haswell is not the end. We're going to keep on going from here."

Intel's goal with the Haswell graphics architecture was – as is true with the compute cores as well – to improve performance without sacrificing power efficiency.

Like the compute cores, Haswell's graphics draw heavily from the graphics cores in the Sandy Bridge and Ivy Bridge architecture – a "similar microarchitecture with some embellishments on it," as Piazza put it. Some of those embellishments are designed to help performance – significantly so in media handling – hand-in-hand with power savings. The graphics subsystem also falls in line with the compute cores in that it focuses on both modularity and scalability.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Ah, if graphics and media engineering were only as simple as drawing pretty pink and blue boxes (click to enlarge)

As an example of the design's scalability, Haswell graphics will be offered in three flavors: GT1, GT2, and GT3 (with the same driver stack), with the highest level doubling many of the sub-elements of the graphics core, but doing so without increasing its power requirements in some scenarios, Piazza said.

One trick that Haswell uses to accomplish this is to decouple the clocks of the compute cores, the graphics core, and the ring bus that connects them – they were locked together in Sandy Bridge and Ivy Bridge. In those two architectures, if you wanted to "turbo up" the graphics core and the ring bus when extra performance was desired, you had to raise the voltage and clock speed of the compute cores as well – a power waster.

In Haswell, Piazza explained, "The ring is totally isolated – a separate domain – and we can move the ring up and down, the graphics up and down, and the CPU up and down independently."

If you're running a Haswell part with GT3 graphics, for example, you have so much additional graphics performance that you can run the graphics innards at lower power, raise the ring bus just a bit to take advantage of all of GT3's bandwidth, but drop the compute cores' power down to a lower voltage state to save juice.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

There are a host of activities going on in Haswell's scalable graphics core, even in its most basic form (click to enlarge)

The basic building blocks of the Haswell processor, as mentioned above, are similar to those in the two previous generation, but with the addition of what Piazza called "some goodies in there for new features." There is, for example, a new resource streamer that handles in hardware many of the buffer-management tasks that otherwise would be handled by driver software.

The GT3 implementation is particularly interesting – look for it in higher-end client Haswell chips when they start shipping, likely late next year or early in 2014. Compared with GT2, GT3 has twice the amount of raw pixel and shader performance clock-for-clock, more vertex throughput, twice the rasterization and pixel backend, and twice the Level 3 cache – two interleaved Level 3 caches, to be exact – to feed the beast with about half a terabyte of bandwidth within the chip.

Piazza would not be drawn out, however, on how many execution units (EUs) are in each of the three GT implementations. When asked how many general-purpose computing on GPU (GPGPU) single-precision flops the Haswell chips would be able to achieve, he demurred. "Now you're asking me to tell you how many shader units, etcetera," he said, "and I'm not going to disclose that."

He did, however, offer a hint. "Since we're not saying how many shaders, each of our shader units can do eight FMAs [fused multiply-adds – remember them from the compute core?] per flop."

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Haswell's GT3 is like Wrigley's gum: double your pleasure, double your pixels, shaders, bandwidth... (click to enlarge)

In addition to these graphics tune-ups, Haswell also features a raft of media-processing improvements, including video-codec expansions and video- and image-processing enhancements, with – of course – power management and optimizations thrown in for good measure.

Haswell, for example, adds decoding for SVC – scalable video coding, an extension to the H.264 codec standard – to the AVC, VC1, and MPEG2 support that's in the Ivy Bridge. SVC, said Intel Fellow Hong Jiang, is a "key enabler" for multi-participant video conferencing and streaming-media servers.

Also aboard Haswell are a motion JPEG (MJPEG) decoder for low-power USB-webcam video conferencing, and an MPEG2 hardware encoder useful for DVD creation and DLNA (Digital Living Network Alliance) digital media streaming – think music, photos, and video for home entertainment use.

Media enhancements include a dedicated Video Quality Engine (VQE) that adds color-gamut expansion – which Hong said maintains color saturation and improves visual quality — plus tunable skin-tone image enhancement, frame-rate conversion, and image stabilization to the suite of video-quality enhancements already in Ivy Bridge. Those include such niceties as de-noise and de-interlace circuitry, as well as adaptive contrast enhancement and more.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Media improvements hew to Intel's marketing push: it's all about 'the experience' (click to enlarge)

As with the graphics core, the GT1, GT2, and GT3 versions have a rising range of media-handling capabilities, with the GT3 version having two times the media-sampling and VQE throughput as does GT2.

Haswell also adds support for 4K video – which, depending upon the industry 4K standard you prefer, refers to video resolutions of 4096-by-2304, 4096-by-2160, or 3840-by-2160 pixels. That last one, by the way, is also known as QFHD (quad full HD), and was recently adopted by Sony for its 4K video.

Hong included in his presentation a slide that showed the powers of Haswell's tunable skin-tone processing, which works its image alchemy by examining a pixel and its neighbors, calculating the likelihood that the pixel is displaying a skin color, then adjusting the the pixel and its neighbors to display the amount of detail desired.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Haswell's skin-tone enhancements may seem subtle, but look closer – they're not (click to enlarge)

With this capability – which Intel in its engineering geekiness calls the Skin Tone Tuned Image Enhancement Filter – you can merely enhance the skin tones in your subject's image, crank them up with higher contrast, or smooth them out – great news for those aging subjects who might want their wrinkles de-emphasized.

Such candidates for facial de-emphasization would most certainly include your aging Reg reporter, who after days of sitting in darkened IDF technical sessions packed with high-level chipheads emerged feeling rather wrinkly, indeed – with said wrinkles appearing not only on his tired visage, but also in his equally exhausted brain. ®