Original URL: http://www.theregister.co.uk/2013/06/03/feature_inside_haswell_intel_4g_core/

Inside Intel's Haswell: What do 1.4 BEELLION transistors get you?

The brains and the brawn of the next Windows 8 slabs

By Tony Smith

Posted in Hardware, 3rd June 2013 12:39 GMT

Feature Intel’s Haswell processor architecture - formally called the fourth-generation Intel Core architecture, which is what the chip giant prefers we call it - has been in development for at least five years. Here's everything you need to know right now.

It first appeared on the company’s product roadmap in the summer of 2008 merely as a codename and a process size, 22nm, but you can bet early work began before that time.

In 2008, Haswell chips were scheduled to ship in the second half of 2012. Chip design is never straightforward, of course, and engineers need time to build on lessons learned when previous generations get a commercial release and are truly market-tested. So it has taken Intel’s engineers a little longer to get Haswell out of the door than they first thought would be the case, though, to be fair, by 2011 Intel was saying Haswell wouldn’t arrive until 2013.

Intel fourth-generation Core, aka Haswell

Inside the quad-core Haswell

And here, less than a month before the forecast first half of the year launch window is up, is Haswell. Intel unveiled mainstream and performance-enhanced mobile-aimed quad-core processors on its Core i5 and i7 lines, all with the customary accompanying IO chip, though more of its system logic than ever before now resides on the processor die itself.

Intel says it will have about 19 Haswell-based mainstream mobile processors out during the remainder of 2013, but it’s kicking off with five of them, all branded Core i7, and all four-core, eight-thread parts.

Lining them up in model number order, we have the 4800MQ, the 4850HQ, the 4900MQ, the 4930MX and the 4950HQ. The HQ’s have slower cores - 2.3GHz and 2.4GHz - but faster graphics, hence higher model numbers than the higher-clocked MQs sport. TurboBoost technology allows one or more cores to be clocked higher if that’s achievable within the chip’s thermal envelope.

The HQ chips come with Intel’s Iris Pro 5200 integrated graphics core, the MQs with the HD Graphic 4600 GPU. Different branding, yes, but the same basic core design, Intel engineers say, just with more execution units in the higher-numbered core.

Speeds and feeds

The MX has the 4600 GPU too, but its dynamic clocking can go a little higher than it can in the others. Like the 4900MQ, the 4930MX has 8MB of shared L3 cache - all the other debut mobile Haswells have 6MB - and it has a TDP of 57W. The others consume up to 47W. Prices run from $378 to $1096.

It’s broadly the same story with the quad-core CPUs that make up Intel’s debut desktop Haswell line-up. TDPs range from 35W to 84W among the i7s, with base clock speeds running from 2.0GHz to 3.5GHz, again able to go higher using TurboBoost. Most have HD 4600 graphics, but one, the 4770R, uses the Iris Pro 5200, though it only has 6MB of L3. All the others have 8MB.

Running up line, we have the the 4765T, 4770T, the 4770S, the 4770, the 4770K and the 4770R. The K is unlocked - the R, S and T suffixes relate to... well, who can say? Intel uses the codes to indicate TDP and thus the form-factors these might be best slotted into: T is 45W; even though they come sooner in the alpahbet, S and R are both 65W, but R has better graphics; and K is 84W. Ditto the suffix-less 4770, so the K actually indicates the chip is unlocked, not its TDP. Clearly though, the higher the letter, the lower the TDP.

Mainstream mobile M-class 4G Core i chips
Mainstream mobile H-class 4G Core i chips

Debut mobile Haswells: M series (top) and H-class CPUs

The R’s price has yet to be revealed but the top-performing K costs $339 and all the others are priced at $303 which really does indicate the suffixes are more about indicating form-factor suitability than anything else.

All the quad-core Haswell Core i5s have 6MB of L3 and all include the 4600 GPU. They run from 2.3GHz to 3.4GHz, each going higher still, if possible, using TurboBoost. But none of them support HyperThreading. They’re all 4670s too, again separated by an T (45W), S (65W), K or no suffix (84W).

The desktop chips are, perhaps, something of a sideshow - a box to tick. Intel is far more interested in the mobile side of the story, not surprisingly given the fact that notebooks now outsell desktops by a fair margin and may well do so to an even greater degree if the Haswell’s architectural benefits deliver the battery life boost Intel claims they will. Reducing power draw benefits desktop computers too - or, rather, the folk who pay for the electricity that powers them - but it’s not as important to desktop users as increasing battery life is to laptop owners.

Power cut

Back in 2011, Intel pledged Haswell would cut the power consumed by a laptop when its lid was closed to five per cent, a promise it reiterated at IDF 2012 last September. Haswell employs a number of tricks in its bid to make good on the company’s promise and deliver what it claims is “Intel’s biggest increase in battery life generation on generation”. That means, it says, doubling or even tripling the battery life achievable with an Ivy Bridge chip, and a 20x stand-by power reduction when compared to a Sandy Bridge machine.

The shift to a smaller production process, from Sandy Bridge’s 32nm to 22nm, introduced last year with the Sandy Bridge re-spin, Ivy Bridge, helps a lot here. Intel has refined its 22nm process further, cutting the current leaking from its tri-gate transistors to between a half and a third of what it was in Ivy Bridge, and dropping the voltage required to drive each transistor. Bringing more IO control onto the processor die - all the digital display interfacing is now handled by the CPU - helps cut the system’s overall power draw a bit, but Haswell goes further.

Haswell power management

First, it has an on-board voltage regulator, which Intel calls its “Fully Integrated Voltage Regulator (FIVR)”. This allows Haswell’s key sections - the CPU cores, the GPU, the IO and memory controllers, and the units that manage them all - to be fed from a single voltage input. Previously there were five separate inputs all fed from voltage regulators on the motherboard. Beyond reducing motherboard component counts and thus cost and power draw, the advantages of an integrated voltage regulator are faster power switching because there’s less oscillation around the target voltage when the switch takes place.

Separate voltage planes mean that deeper sleep states can be reached too, by completely shutting down parts of the core that aren’t required. “Everything that is not needed is turned off!” boast the engineers. The chip transitioned in and out of these states more quickly: Haswell is 25 per cent faster than Ivy Bridge in this regard, Intel has said in the past.

Placed on the chip, the FIVR can be closely tied into the processor’s power management system, which in Haswell uses the chip’s understanding of what it’s being asked to do in order to exploit more power-reduction opportunities. It can see, says Intel, that for a certain period of time parts of the chip - and even some system components - can be put into low-power states or turned off altogether to save energy. After the calculated duration, Haswell’s power manager - Intel calls it a Power Optimizer - wakes them up again.

Optimizer

Smarter power management isn’t merely central to making Haswell-based systems’ batteries last longer, but it also makes them more useful and responsive the way an ARM-powered mobile devices can be: constantly picking up messages, even when sleeping, and ready for use in a near instant. That’s a crucial facility laptops must gain if they aren’t to be overshadowed by tablets any more than they already are.

Haswell delivers three of power states that form a sub-set of the standard S0, ‘awake’ state that has been part of the ACPI (Advanced Configuration and Power Interface) specification for years. Instead of remaining fully awake in the S0 state, Haswell drops to the equivalent of existing S1 to S3 sleep mode power consumption levels but keeps an eye open, as it were, so it’s ready to handle user interaction. Think of it as a quick nap while the user pauses to read something on screen, or a deeper doze when the machine’s not being used. In each case the Power Optimizer can quickly bring everything back so the user doesn’t notice it hasn’t been entirely paying attention.

Haswell power management benefits

So when laptop’s lid is closed, the system will drop to a power consumption level existing machines reach only when hibernating - the S3 state - but the system is nonetheless sufficiently awake to be ready to use by the time the user has lifted the lid, and to periodically poll for messages and other net-delivered data. Haswell delivers “idle power approaching [that of] tablet CPUs”, claims Intel.

Latency data from compatible system components allows the Haswell Power Optimizer to know how long it takes each of them to wake up, ensuring the whole system - or at least those parts of it required to give the user immediate operation - are all ready at the same time. To gain the full benefit, then, notebook makers will have to equip their kit with components that can give Haswell that latency information, so expect Intel to exert even more control than it already does, by certifying which components give full Haswell Power Optimizer compatibility.

Haswell Power Optimizer

Latency informations also helps Power Optimizer schedule tasks to align them, resulting in burst of activity bookended by periods of longer power-saving inactivity. Windows 8 already supports this kind of operation on order to keep its new UI live tiles updated, a technique Microsoft calls Connected Standby, but Haswell extends it to Windows desktop apps too and integrates it into its more modern sleep state set-up.

Unlike Ivy Bridge, Haswell maintains separate clock signals for the cores, for the GPU, and for the L3 cache and the ring bus that links it to those other two modules. At the cost of cache performance, this means you can keep the cores slow when only the GPU needs to exchange data with the cache, saving power. Slowing down the core, because its frequency isn’t pegged to the GPU, means, conversely, there’s power left over to raise the GPU’s frequency, if it’s required.

Order, order, you are out of order, sir!

Each core itself is largely the same as those found in Ivy Bridge processors. Intel has improved the front end - the bit that pulls in the x86 instructions that programs are compiled into before converting them into micro ops, which are the chip’s native instruction set - so it’s better able to anticipate in which direction upcoming branches in the code will take the execution stream, but Intel does this with every new generation of its architecture.

After conversion from x86, the micro ops are juggled into a new order that allows the core’s many instruction-processing engines to kept well fuelled without (hopefully) breaking data dependencies in the original program, which can happen if an out-of-order action changes the value of variable beyond what another instruction was expecting it to be. Haswell has more capacity than its predecessors to sort through the micro ops to see how many can be executed in parallel. It has more core register space for temporary data.

Haswell buffers

Where Ivy Bridge had two, 28-op buffers, one per thread, from which micro ops were routed to free maths units, Haswell cores have one, 56-op buffer with eight output ports rather than six, the better to keep as many micro ops flowing as possible.

The array of available maths units has again been tweaked to accommodate the loading Intel’s modelling of real-world workloads is most likely to require. That’s a moveable feast, of course. New applications and uses may have come into play which the chip engineers didn’t take into account, or program patterns they did anticipate and designed for may have fallen out of fashion.

Doubling the L1 and L2 cache bandwidth, by widening their access ports, and smartening up the core’s ability to cope with cache misses will help here, even though the cache sizes and structure remain unchanged. Meanwhile, Haswell adds new instructions - AVX (Advanced Vector eXtensions) 2 - to help handle multimedia data and the kind of numbers high-performance computing rigs crunch. Intel promises big performance gains in cryptography code, for instance.

Haswell execution units

Another new set of instructions, Haswell’s Transactional Synchronization eXtensions (TSX), help programmers take advantage of the chip’s ability to spot situations when the locks established by one thread to prevent another overwriting its data are not actually necessary. In which case, the overhead of locking and subsequently unlocking the data can be removed by ignoring the locks - technique called ‘lock elision’. With this ability, coders can insert lock code safe in the knowledge that if it’s not actually needed, there will be no performance hit. And they can add more locks without over-complicating their code.

Haswell’s cores form only one small part of the chip’s die. A quad-core Haswell has a surface area of 177mm2, but only a third of that is taken up with those four cores. The remaining two thirds are split roughly half and half between the GPU, and the system logic and caches. Together all these elements comprise 1.4 billion transistors.

Graphics

Haswell’s graphics core comprises generic front-end and back-end, with one or more “slices” in between, each of which contains eight execution units and associated caches and such. It’s the same architecture as Ivy Bridge, but expanded with a greater number of execution units and the addition of a new processing engine - a “Resource Streamer” in the jargon - to do a lot of the set-up work the CPU cores would once have handled. This increases the independence of the GPU, which is running on a separate clock from the cores, don’t forget. It can do more work without requiring the cores to be clocked up. The front end has been beefed up to keep extra slices pumped with data.

Slices are independently power gated, by the way, so they can be shut down if they’re not needed.

Intel currently has three Haswell GPU Variants: the HD 4600, HD 5000, Iris 5100 and Iris Pro 5200. The first contains a single slice; the rest have a second slice, essentially doubling the (undisclosed) number of execution units in the GPU. Slices work on individual groups of pixels on the screen, says Intel.

Haswell GPU

Slicing pixels: Haswell’s GPU architecture

The Iris Pro configuration will come in versions of Haswell that incorporate embedded RAM. The memory is in the chip package but not on the die, and Intel is not saying how much of it there will be. But it does say that the RAM cache is equally accessible to the chip’s cores and GPU through a low latency, high throughput connection. Intel engineers even claim it “enables discrete-class graphics performance”. No wonder Apple is rumoured to be particularly interested in the technology for future MacBook Airs and Pros.

Even without the extra RAM, Haswell’s GPU will be able to support three displays simultaneously through a mix of DisplayPort, HDMI and VGA external links and internal connections to a laptop’s LCD. Resolutions of up to 4096 x 2304 pixels at 24Hz are supported, but you’ll be able to do the slightly lesser 3840 x 2160 at 60Hz, which is 4K x 2K. On the API side, Haswell will support DirectX 11.1, OpenCL 1.2 and OpenGL 4.0.

Movie watchers may appreciate Haswell’s support not only for H.264’s Scalable Video Coding (SVC) feature - no H.265, though - and 4K x 2K. ®