The Register® — Biting the hand that feeds IT

Feeds
  • print
  • alert

Cloud based data management

Compute-core competencies

Although Haswell's compute cores are based on those used in the now-familiar 32nm Sandy Bridge and its follow-on 22nm Ivy Bridge, Intel has added a number of changes that should improve performance.

"The first thing to look at on the performance side," said Intel engineer Singhal, "is 'What are we doing for the software that exists today?'" One important factor in this effort, he said, was to keep all the compute-core pipelines essentially the same as they have been through Sandy Bridge and Ivy Bridge.

Intel has, however, made what he characterized as "significant" changes within the cores, including increasing the depth of the core's buffers, which gives the cores more flexibility in Haswell's out-of-order execution – the chip's ability to better optimize the flow of instructions and data, which also helps use more parallelism in execution.

Haswell has also improved branch prediction – the chip's attempt to accurately guess the correct path for data flow before it actually knows which way it will proceed in if-then-else processing. The more accurate a chip's branch prediction, the less frequently it will need to start a branch over it if guesses wrong, which wastes time.

And wasting time, in contemporary chip engineering, is worse than wasting money – it's wasting power.

According to Singhal, branch-prediction improvement is "something we tend do every generation." Additionally, the fact that the execution pipelines haven't been lengthened from the ones in Sandy Bridge and Ivy Bridge assures that the "do over" time for an incorrect branch prediction is also not lengthened.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Haswell may be based on previous architectures, but it has plenty of new tricks up its silicon sleeves (click to enlarge)

Instruction buffers at the top end of Haswell's pipeline have been enlarged, which Singhal says will help the performance of apps that have a large code footprint, "which we're seeing become more and more common." They'll also help to improve Haswell chips' chances of initiating efficient code-execution parallelism – more on that in a moment, as well.

Sandy Bridge and Ivy Bridge chips can execute six operations per clock cycle; Haswell increases that number to eight. One of the "ports" for those new execution units now supports an additional integer arithmetic logic unit (ALU), which – you guessed it – provides another place to accomplish integer arithmetic or logical operations.

A new address-storing port has been added, as well, which will free up some of that duty from two existing ports.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Two new ports – no waiting – plus the efficient fusing of multiply and add functions (click to enlarge)

One key newbie to Haswell's microarchitecture is the addition of two fused multiply-add (FMA) floating-point operation units. One slick thing about FMA is that since it can compute both a floating-point add and multiply in the same cycle, doing so not only saves clock cycles but also creates the opportunity to round the result once, and not twice as required if the operations were done separately. This single-rounding capability improves mathematical precision.

In addition to these improvements over the existing Sandy Bridge and Ivy Bridge architectures, Haswell also introduces some new technologies and capabilities. Some of these new goodies, however, will require some software changes. "Ideally," Singhal said, "just a recompile, but in other cases a re-optimization of the code."

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Is it just us, or do ports 0, 1, and 5 seem rather more well-rounded in their abilities? (click to enlarge)

One of these new capabilities is a set of compute instructions called AVX2, a beefing up of the AVX (advanced vector extensions) instructions introduced in Sandy Bridge – and no, Haswell's AVX2 has nothing to do with that other AVX2 that you content-creation types may know of from Avid.

Simply put, AVX – and now AVX2 – instructions are extensions to Intel's long-established SSE instructions that can operate on multiple data words in parallel, using a single instruction.

Here's a geek-cred tidbit: SSE is a bit of a recursive acronym – or initialism, for you purists among us – that stands for "streaming SIMD extensions"; SIMD stands for "single instruction, multiple data." (SIMD, by the way, is a true acronym, pronounced "sim-dee".) Drop that into conversation at your next cocktail party and watch members of the opposite sex swoon at your erudition. Or of your own sex, should that be your preference. Whatever.

What you need to know about AVX2 is that it will allow clever coders to speed up applications that are floating-point intensive by using FMA to double both single-precision and double-precision floating point operations per clock cycle per core. According to Intel, expect noticeable speed-ups in such applications as image and audio/video processing, scientific simulations, financial analytics, and 3D modeling and analysis.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

If you're an old Intel hand, you'll appreciate the vast improvement of AVX2 over 1996's Pentium with MMX (click to enlarge)

AVX2 improvements don't stop with the FMA-enabled doubling, however. There are also new instructions for data permutation and shuffling, and a new gather instruction allows you to load data from multiple non-contiguous places in memory, reducing latency and freeing you from having to perform pesky hand-coded data bookkeeping.

Finally, AVX2 goes beyond AVX in that it can also handle integer instructions, not just vector instruction. Your humble Reg hack is not completely clear as to exactly why that's so nifty – although I could venture a semi-educated guess or three – but Singhal's IDF audience seemed duly impressed.

Regcast training : Hyper-V 3.0, VM high availability and disaster recovery

Next page: Caching in, big-time

Software decently written?

"The key here is that most software is actually fairly decently written"

He must be joking...

Pentium 4 was deemed as having very poor performance because to take advantage of it, software needed to be "fairly decently written", and compiled with a decent compiler. The problem is that to date there is only one compiler worth a damn for x86 - Intel's own (ICC).

I did some performance testing a while back:

http://www.altechnative.net/2010/12/31/choice-of-compilers-part-1-x86/

Clock-for-clock, with crap compilers (GCC, PGCC) Pentium 4 is about 40% slower than Pentium 3. But with ICC, Pentium 4's performance actually goes up by 20%, clock-for-clock, compared to a Pentium 3.

It's not just down to software being decently written (which it isn't a lot of the time) - it's also down to the compiler doing a decent job (which most don't). On one hand, one could argue something along the lines of: "Pentium 4 didn't suck - you were merely too stupid to use it properly." Unfortunately, this is way, way beyond the average consumer to either understand or do anything about and it is the consumer's perception that decides whether a product is going to be a success or failure.

5
0

Wait ... what?

"Intel CEO Paul Otellini has called "the third pillar of computing," security – the other two pillars being energy efficiency and internet connectivity."

This twat is a marketard, not an engineer ... he wouldn't know the difference between ones & zeros if he got 'em under his carefully manicured fingernails. He's part of the (current) reason Intel is heading for the bit-bucket.

By way of reference, the real three pillars of computing are memory, IO, and CPU ...

5
1

"We thought that you, inordinately intelligent and tech-savvy Reg reader, might enjoy a deep dive into their handiwork."

Have you read all the comments on the register? :P

4
0

More from The Register

Samsung Galaxy Note 8: Proof the pen is mightier?
Sammy’s iPad Mini killer has a stylus to stab other rivals too
Microsoft lures buy-curious vixens, corduroys with a cheap fondle
Surface slab sales latest: Will no one rid Ballmer of these turbulent tabs?
First look: iOS 7 for iPad
No, Apple hasn't released it yet, but that doesn't stop intrepid devs
 breaking news
Curtain drops on Apple Store ahead of WWDC: What lies behind?
Steve Jobs watching from on high. No pressure, lads
 breaking news
Cold, dead hands of Steve Jobs slip from iPhones: The Cult of Ive is upon us
Billionaire biz baron's death clears way for uber-shiny iOS 7
Airbus imagines suitcases that find themselves
Point your mobe at your smalls to track their every move
Surprise! Intel smartphone trounces ARM in power trials
Tests show equal performance while sipping significantly less juice
Samsung plans LTE Advanced version of Galaxy S4
1Gbps download capability could stiffen drooping S4 sales forecasts
Apple said to be 'exploring' 5.7-inch iPhone
Who's the copycat this time, Mr. Cook?