The Register® — Biting the hand that feeds IT

Feeds
  • print
  • alert

Caching in, big-time

It's all well and good to have increased computing capabilities, but as Singhal said, "In order to actually use those functional units, you have to be able to feed them, you have to be able to get data to them." And that's where Haswell's improved caches come into play.

It's not that Haswell's Level 1 and Level 2 caches are any larger than those of Sandy Bridge and Ivy Bridge – they're not. What has been improved is the chip's ability to get data into and out of those caches more quickly.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

You gotta like the improvement curve since 2008's Nehalem. Ah, how the time does fly (click to enlarge)

As with Sandy Bridge and Ivy Bridge, Haswell's L1 instruction and data caches remain at 32KB, and its L2 cache remains at 256KB. What's different is the performance that Haswell delivers from those caches by doubling the read and write ports to 32 bits and the load ports to 256 bits – that's a lot of bits being made available to the execution units per cycle.

This is a boon to AVX2: Haswell can sustain two full AVX2 reads and a full AVX2 write in a clock cycle, according to Singhal.

In addition to these bandwidth improvements, Singhal pointed to Haswell's improved cache-line latency and the elimination of cache conflicts in the new microarchitecture. "Previously there were cases when two loads may not be able to read the cache simultaneously because of the banking architecture that we used," he said. "We've removed those restrictions."

The take-away? Haswell can get more data to its execution units without those workhorses having to waste time and power twiddling their digital thumbs. The result? Better performance and better power savings.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

A thick pile o' numbers, to be sure, but if you care about cache performance, we wager you'll like what you see (click to enlarge)

The L1 cache, as described above, was not the only beneficiary of Intel engineering love. The L2 cache was improved, as well. In Sandy Bridge and Ivy Bridge, a line could be read from the cache every other clock cycle; now it takes just one clock cycle to do the same.

Before we step out of Haswell's compute cores and sidle on over to its graphics core, there's one more new bennie to talk about: what Intel calls transactional synchronization extensions, or TSX.

TSX works to improve parallel processing. "We have parts today that we're shipping on the client side that go up to eight [execution] threads," Singhal said. "On the server side we support up to 20 threads per socket today, so if somebody puts together a four-socket system, they're supporting already up to 80 threads – and of course core counts will continue to go up on the server side."

To take advantage of those parallel threads, you of course need to develop parallel software. Duh. But that task can be a complete pain in the yinger, especially if all of those threads are working on the same data set in conjunction with one another.

Enter TSX, which endeavors to move the work of low-level optimization from the code writer over to the hardware upon which their workloads are running.

I won't dig deep into the intricacies of TSX except to say that it involves two technologies: hardware lock elision (HLE) and restricted transactional memory (RTM). Both require coders to insert what are essentially lock and unlock commands into their software that tell Haswell when to search out parallelism opportunities that are not explicitly written into the code itself.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

Okay, okay, okay – parallel programming is hella hard, so how abut some hardware help? (click to enlarge)

According to Singhal, HLE is more suitable for legacy code – developers will simply insert the XACQUIRE and XRELEASE commands into their code as appropriate, and RTM's XBEGIN and XEND commands give the devs more wiggle room. "[RTM provides] a little more flexibility, but a little more work, as well," he said.

Before we leave the compute cores and move on to Haswell's graphics and media enhancements, let's take a quick look at one last bit of compute-core goodness: advancements in the microarchitecture's ability to help shore up what Intel CEO Paul Otellini has called "the third pillar of computing," security – the other two pillars being energy efficiency and internet connectivity.

"Today, cryptography is huge," said Intel CPU architect Bret Toll at one IDF technical session. "It's very important. Every time you get on the web and do any kind of transaction, it gets encrypted and decrypted, sometimes multiple times."

Those cryptographic functions take time and processor power, so to speed them Intel has added new crypto-supporting instructions to the Haswell microarchitecture and improved existing architectural features that support encryption and decryption. The wide vectors in AVX2 provide more cryptographic oomph, as well.

Slide from Intel Developers Forum 2012 providing details of Intel's 4th Generation Core Processor, codenamed 'Haswell'

A bright, shiny dime to the first Reg reader who can decode each and every entry in this alphabet soup (click to enlarge)

Toll told his audience that although past microarchitecture generations had seen improvements in support for encryption and decryption, "As you can see, on Haswell I think we've hit it out of the park with security."

Software decently written?

"The key here is that most software is actually fairly decently written"

He must be joking...

Pentium 4 was deemed as having very poor performance because to take advantage of it, software needed to be "fairly decently written", and compiled with a decent compiler. The problem is that to date there is only one compiler worth a damn for x86 - Intel's own (ICC).

I did some performance testing a while back:

http://www.altechnative.net/2010/12/31/choice-of-compilers-part-1-x86/

Clock-for-clock, with crap compilers (GCC, PGCC) Pentium 4 is about 40% slower than Pentium 3. But with ICC, Pentium 4's performance actually goes up by 20%, clock-for-clock, compared to a Pentium 3.

It's not just down to software being decently written (which it isn't a lot of the time) - it's also down to the compiler doing a decent job (which most don't). On one hand, one could argue something along the lines of: "Pentium 4 didn't suck - you were merely too stupid to use it properly." Unfortunately, this is way, way beyond the average consumer to either understand or do anything about and it is the consumer's perception that decides whether a product is going to be a success or failure.

5
0

Wait ... what?

"Intel CEO Paul Otellini has called "the third pillar of computing," security – the other two pillars being energy efficiency and internet connectivity."

This twat is a marketard, not an engineer ... he wouldn't know the difference between ones & zeros if he got 'em under his carefully manicured fingernails. He's part of the (current) reason Intel is heading for the bit-bucket.

By way of reference, the real three pillars of computing are memory, IO, and CPU ...

5
1

"We thought that you, inordinately intelligent and tech-savvy Reg reader, might enjoy a deep dive into their handiwork."

Have you read all the comments on the register? :P

4
0

Re: that hidden message

Darker-skinned folks faces are OK to start with.

2
0

Re: AVX2 on integers

The AVX2 (long long) integer operations will be in there for cryptographic processing.

The really interesting bit is the transactional memory TSX extensions (IBM’s is already well along the curve). TSX should be a big kicker for TP and HPC, but writing software to take advantage of it is going to take a big paradigm shift away from Garbage-collection to Bedouin memory management

2
0

More from The Register

 breaking news
Microsoft reveals Xbox One, the console that can read your heartbeat
Upgrades Live service – and no always-on requirement
US boffin builds 32-way Raspberry Pi cluster
Beowulf cluster built for the price of a single PC
Review: HP Pavilion 14 Chromebook
All roads lead to Chrome?
Euro PC shipments plummet into bottomless pit of DOOOOM
11th quarter of decline, 20pc drop on last year - Gartner
Fairphone goes on sale to all
The Android handset that's PC can be yours
Nintendo throws flaming legal barrel at YouTubing fans
All your walk-through vid revenue are belong to us
Startup hires 'cyborg' Mann for Google Glass–killer project
3D augmented reality specs coming your way this year

Hands on with Hyper-V 3.0 and virtual machine movement

Our award-winning Regcasts have teamed up with training provider QA for the deepest of deep dives into Hyper-V, including a live demo.

Understand VM movement - just click to play, or go here for a bigger version.