Feeds

Project Jackson – why SMT is the joker in the the chip pack

Parallelism to the People

  • alert
  • submit to reddit

No, SMT can't be the chip business' best-kept secret for very much longer. Simultaneous Multi Threading - the business of making one processor look like several software applications - is being snuck into Intel's Foster. Although officially Chipzilla won't say a word about Project Jackson - multithreaded P4 - before it's due to be unveiled in the summer. But if you've grown weary of listening to the treadmill of higher clock frequencies and process shrinkages every few months, then SMT will be real news. You do need to know about this stuff for lots of interesting reasons. So here we cover a few...

Thanks to its supercomputer pedigree and DEC's early patronage, SMT has been buzz-word du jour for at least, oh five years. But it's a kind of cross-dresser: it crosses boundaries of out-of-processing and superscalar arguments in deep microprocessor arguments. Theoretically it's a disruptive technology. And not just theoretically...

For Intel, it also poses a delectable marketing problem, as SMT could extend the life of the 32bit P7 for long enough to make IA-64 a distinctly unattractive proposition. Or at least, make its price/performance proposal look unattractive. That's a polite way of saying IA-64 might look pretty sucky for some time, if SMT'd IA-32 realises its potential. And there's little reasons it shouldn't...

Finally, if this all sound terribly esoteric, we'll point out that SMT chips could be freaking into all kinds of strange areas, such as games, and smart servers, pretty soon now. Which should cause no end of inflexion points too for folk in the rack server or hosting businesses. So here's a rapid catch-up on SMT.

Life on Mars

Today the CTO at Cray, Dr Burton Smith takes the credit for producing the first multithreaded chip systems back in the seventies. Smith tells us the first generally available paper on the subject was published in 1978, and the first silicon went into operation the following year, but the Pandora's Box had originally been opened by Cray:-

"The first multi threading machine was a peripheral processor Cray CDC 6600, but I was the first to put multi threading in the CPU," says Burton. The origins were various, he explains, and he certainly wasn't alone in pioneering exotic hardware. The Russians had also created a multithreaded supercomputer, the MARS-M [which was] "very secret, but architecturally, pretty wild" Architect of MARS-M.

In subsequent years Burton designed a variety of exotic dedicated Denelcor multi-threaded processor based machines. All told, six were sold before Denelcor folded in 1985. Smith went onto found supercomputer manufacturer Tera in 1988. Tera took Cray off SGI's hands last year and the MTA and MTA2 Smith helped design continue under the Cray brand.

Now we're not exactly talking high volumes here, but the next breakthrough came from Burton's neighbours in Seattle the University of Washington. A computer science PhD Dean Tullsen, with his mentors Susan Eggers and Hank Levy, set about making multithreaded thinking apply to commodity, industry standard process.

"This is what got the camel's nose under the tent," says Smith. "These papers convinced the microprocessor industry that this was not hard to do. In other words, they could thread without throwing away their investments in know how-creating superscalar chips.

"SMT's always had the reputation of being daunting - but really, it's pie simple. You just have to be brave and grasp the nettle!""

DEC takes a decko

Dean Tullsen takes up the story.

"In 1995 we playing with the general model of SMT without thinking how much you'd implement it," he told us. But his work had already come to the attention of Joel Emer at DEC by the end of 1994. Emer had shaped DEC strategy in the late 80s, in work that led to the Alpha chip. As a veteran of studying processor performance he helped tune the VAX chip, and studied under Ed Davidson at the University of Illinois, who'd himself published papers on fine grained multithreading. And he was impressed by the Tullsen/Eggers/Levy work.

"They really took a different slant on multithreading," Emer told The Register. "The idea in the paper was that multithreading doesn't have to sacrifice single stream performance. It was my first exposure to the idea of [a chip processing] whatever instruction from whatever thread"

Together with Rebecca Stamm of DEC, Emer and the Washington team published the follow-up paper in 1996, which applied SMT more specifically and practically. And lo, it was Compaq (who'd acquired DEC) who took the laurel for announcing the first mainstream multi-threaded chip, the Alpha EV8, at Microprocessor Forum in Fall, 1999.

Tullsen has subsequently consulted with Intel about introducing SMT, in work which Intel VP Pat Gelsinger referred to a couple of weeks ago, but there's been no formal announcement by Intel to introduce multithreaded processors. And Dean as he's worked with Intel, can't say anything about it.

So let's go back to the team's breakthrough. What did they do, exactly?

"What distinguished it from genuine multhireaded machines such as the Tera, and the coarse-grained machines such as Agrawal, was that we went for a more aggressive model," says Tullsen.

SMT isn't free. It adds extra workload on a chip multiple register files and program counters, which can account for up to 10 per cent extra hardware. But the benefits of a multithread chip can be pretty quickly realised:-

"It's about pushing every transistor every cycle, " says Eggers. "SMT achieves its high instruction throughput by using thread-level parallelism to compensate for low per-thread instruction-level parallelism."

So while a single-threaded CPU can choke an instruction sequence, multi-threaded chips can hit the instruction queue with other waiting tasks. The deep pipelines of today's chips (a 19-stage pipeline is used in P4) result in many calculations being discarded; which is wasteful.

"SMT uses instructions from the other threads to fill out the issue width. Or if a thread encounters a long-latency instruction, SMT can use instructions from other threads to prevent processor stalling," says Eggers.

Or as a friendly reader describes the difference: "In regular superscalar you have a pool of resources in various stages of completion, and each is tagged to identify which calculation it belongs to - which register it's impersonating. Whereas in SMT you have a pool of resources each tagged to say which thread they belong to and which virtual CPU's registers they impersonate."

In simulations, says Eggers, an 8-wide superscalar microprocessor executing a web server workload such as Apache can produce on average 1.1 instruction per cycles (IPC). SMT has quadrupled that, achieving 4.6 IPC. Commercial databases reveal a similar performance improvement with a typical transaction processing workload achieves 2.3x IPC score on SMT, and only .8 IPC on the superscalar according to research conducted by Eggers' team.

But with chips cheap, and multi-processors a commodity, why go to the trouble of making expensive investments in the chip? Isn't it cheaper to write a better compiler?

It's not that simple, says Tullsen. "The memory problems have not been solved. It's hard to parallelize code perfectly" In a typical parallel architecture there's hundreds of clock cycles between threads. But with SMT "you can being to parallelize things you could never parallelize before. It's just taking advantage of the natural parallelism that's already out there," he says.

Cache prizes

And surely that means the long power-hugging pipelines of the most recent vintage are a costly baggage? Baggage that can get pretty hot? At a recent ISSCC conference, Gelsinger didn't detract from these design decisions, but strongly suggested that they weren't really scalable for the future. "No one wants to take a nuclear reactor on a plane with them," he said, crushing the hopes of us folk who've always dreamt of carrying around a portable nuclear reactor onto planes with us... but we digress.

Gelsinger pointed out that chips would have to get smarter, rather than more extravagant, and clumsier, with their guesswork. Now Smith's hardware had no on-chip cache at all, gambling on the certainty that the processor could handle so many more immediate instruction latencies that slow and expensive trips out to memory would be pretty infrequent.

We wondering if cacheless chips were viable by deploying SMT, before a friendly expert (who shall remain nameless) pointed out some economics:-

"To go wholly cacheless like Tera is a huge leap and that's because say, a 2GHz CPU is running roughly 100 times faster than the memory access time. And that's assuming it is not sharing access to the memory chips..."

Cache is cheap, too, with very low ratios of active transistors that are largely underworked, "so the only current flowing is the miserly trickle since they don't fully switch off."

"So if you take away the cache and force all the threads to stall each time they touch memory then you need something north of 20 threads to keep all the balls in the air."

"On scientific workloads a good compiler can find those, as Tera claim, but on general purpose computing (like Intel or Compaq/Alpha are interested in) you will be lucky to find eight."

Thwarted by compilers again, huh? (Not really).

Secondly, what really burns power is wires not transistors. On-chip cache isn't as expensive (cost-wise) than it would appear, and it's also cheaper (heat-wise) to keep hitting this cache than taking a trek out to main memory.

And low power is a biggie amongst the multi-threading folk. Mario Nemirovsky's XStream Logic processor is a venture to create low-cost, low-power multithreaded chips for network appliances such as routers. Meanshile, Compaq's Alpha and eventually, Intel's Foster most assuredly will be sold in SMPs.

New boots for old

Where SMT really starts to be fun is in new applications. Both Burton Smith and Susan Eggers pointed out that SMT could help with new stuff that right now, is too expensive to think about. Eggers suggests that high instruction level parallelism could allow software developers to build-in helper threads into server apps, which could pre-fetch data, or do performance monitoring. In some niches, this could pay dividends.

And Smith suggests that even more practical uses, such as rendering games graphics, or speech processing.

It's probably worth pointing out that the threading we're talking about with SMT isn't the direct equivalent to the threads you might be familiar with as a software developer. Sure, a threading OS can increase the parallelism, but an SMT will grab whatever it's offered. Even coarse grained Unix systems. If an OS (or Java) can parallelize these instruction sequences then so much the better. But for a software application SMT just behaves like a faster chip.

It's going to be intriguing to see how real world SMT silicon matches up to the sims. Foster isn't due to be revealed until the summer, although the specifications have already been distributed. Alpha EV8 is late, and not expected in system until next year. Which could make the Xtreme the first to ship gen-u-wine silicon. ®

Related Links

University of Washington Simultaneous Multithreading home page

Related Stories

Intel's Jackson will offer 2 chips for 1
Intel Foster secrets revealed

Whitepapers

A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Beginner's guide to SSL certificates
De-mystify the technology involved and give you the information you need to make the best decision when considering your online security options.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.