Happy 40th birthday, Intel 4004!
The first of the bricks that built the IT world
On November 15, 1971, 40 years ago this Tuesday, an advertisment appeared in Electronic News for a new kind of chip – one that could perform different operations by obeying instructions given to it.
That first microprocessor was the Intel 4004, a 4-bit chip developed in 1970 by Intel engineers Federico Faggin, Ted Hoff, and Stanley Mazor in cooperation with the Japanese company Busicom  (née the Nippon Calculating Machine Corporation) for that company's adding machines.
Busicom held the rights to the 4004 in 1970, but released them to Intel in 1971. Intel then offered the world's first processor for sale, and 40 years later that world is a very, very different place.
At the time, only the most far-thinking futurists could have imagined the 4004's impact. For starters, the chip itself wasn't all that impressive. It ran at 740KHz, had around 2,300 transistors that communicated with their surroundings through a grand total of 16 pins, and was built using a 10-micron process.
Exactly how far have we come in process technology since the 4004? Well, as your Reg reporter once calculated , if the width of a Intel 2nd Generation Core CPU's 32-nanometer process were expanded so that it could be spanned by an unsharpened No. 2 Ticonderoga pencil, the 4004's 10-micron (10,000nm) process, equally expanded, would be wide enough to fit an 18-wheeler followed by a half-dozen 1962 Cadillac Eldorados and a Smart Car.
To say that microprocessors have changed radically over the past 40 years is to utter an empty truism. What's far more interesting is to take a look at the way in which those changes have evolved: the problems encountered, the decisions made, the discoveries ... well ... discovered.
And so to review Intel's 40-year journey from the 4004 to today, The Reg contacted two Intel Senior Fellows who have been responsible for a good chunk of how their company's offerings have grown from the 2,300-transistor 4004 to the over-two-billion-transistor 2nd Generation Intel Core i7-3960X released Monday morning .
We spoke with Steve Pawlowski , who has been intimately involved with a good portion of Intel's microarchitectural development since the early days, and Mark Bohr , who heads up Intel's process architecture and integration efforts.
We learned a lot, such as the fact that for the first 30 years or so, there really weren't all that many challenges in process development. "Most people would say that the period from 1971 until the early 1990s – actually, even to the end of the 1990s – that 30-year period was really the golden era of traditional, classic transistor scaling," Bohr told us.
In those days, the materials used in processors hardly changed – it was based on silicon dioxide for the gate insulator, or dielectric, and doped polysilicon for the gate electrode. "We were simply scaling," Bohr said, and with that scaling came reductions in power needs, and continual improvements in transistor densities and performance.
'We just plain ran out of atoms'
But that relatively straightforward sequence of improvements didn't last. In the early 2000s, Bohr told us, "Traditional scaling ran out of steam." The problem was that as process sizes got smaller and smaller, circuits tended to leak more power proportional to the amount of electrons that are doing useful work.
Or, as Bohr put it, "We just plain ran out of atoms." When you're talking about gate oxides, he explained, a 1.2nm deposit is only about six atomic layers thick.
At that point, Bohr said, it became clear that it was necessary to investigate, develop, and introduce what he called "revolutionary features", such as strained silicon , then high-k metal gate , and most recently what Intel calls tri-gate  transistors and much of the rest of the world calls FinFET structures.
But we're getting ahead of ourselves. On the architectural side, there were plenty of developments underway while the process engineers were busily scaling down the chips' transistors.
Pawlowski wasn't in on the earliest days of the 4004's morphing into the 8-bit 8008 of 1972, and then the development of the much more capable 8-bit 8080 in 1974, which was the microprocessor that really got the ball rolling.
The 8080 wasn't alone, though – there was plenty of competition in the earlier days, such as the Zilog Z80, Motorola 6800, and MOS Technology 6501, which Pawlowski told us were all essentially equal competitors at the time.
"The 8080 was essentially just a simple processor," he told us, "but it had a program counter, it had these nice, wonderful eight registers that we have today, the eight-bit registers. Then the 8085 was an extension of that – it was essentially a 5-volt part."
Pawlowski's first baby at Intel was the 8086, which had a 16-bit external bus, unlike its compatriot, the 8088, which had an 8-bit external bus.
The 8088's claim to fame was that IBM chose it for its groundbreaking IBM Personal Computer – aka the Model 5150 – which it introduced in 1981. According to Pawlowski, IBM chose the 8088 because its 8-bit external bus was compatible with peripherals that had been developed for the smaller-market 8080 and the 8085.
After the 8086/8 came the 80286, which Pawlowski described as not a ground-breaking departure, but rather "just a better architecture than the 8086." The 80286 still required a math coprocessor, the 80287. Unfortunately, Pawlowski remembers, "The 286 added some interesting things, like with the math coprocessor they added an interrupt field which clobbered some of the old interrupt fields in the 8086."
All progress is not linear.
Change you can believe in
While the 80286 was essentially an update to the 8086, the "real change" came with the 32-bit 386, Pawlowski said.
"The beauty of it is that it went to large segments," Pawlowski said. "So instead of having the typical 64k segment architecture, they actually could go the full flat address space and go to four gigs."
As he recalls it: "The big problem we were facing with Motorola and the 68K – which was the competition at the time – was they had a flat address space and we were segmented, because that was the architecture we'd chosen to build the 8086-based architecture on."
Pawlowski worked on the first Multibus board built for the 386. On that board, his team added a 64K direct-mapped cache in front of the 386. "It wasn't integrated inside the part," he told us, "but it was a 16MHz clock, and so we were getting to the point where we were starting to see some of the stress points of the memory architecture – memory access patterns, which were 150 nanoseconds."
But with the 64K direct-mapped cache, "We did some pretty nifty little things," he said. "And it ran 16-bit code really well, so that was the real success."
The 386 was the chip around which Intel started building motherboards. When the 486 came along, it integrated that motherboard cache into the chip itself, and it also integrated the math coprocessor in the 486DX version. The 386 had still relied on the separate 387 chip – and, yes, there was a 386DX, but that designation had nothing to do with an on-chip FPU.
After the 386 and the 486 came not the 586, but instead a chip that was rechristened by the Intel marketing department as the Pentium, and was built using a new microarchitecture known internally as P5.
"That became the first superscalar machine," Pawlowski told us, superscalar being the term of art that describes a processor that has more than one concurrent execution sequence, or pipeline.
"That's where we actually had multiple execution units," he said. "Not necessarily the same, but the scheduler was at least smart enough to look inside the machine, if it had to do an add, had to do a multiply, potentially some type of fetch, or some other type of instruction, it could actually look for places where it could get more locality out of the instruction, out of the machine itself."
Playing catch-up with Motorola
The Pentium's superscalar nature was playing catch-up with Motorola, which had offered superscalar chips for some time. According to Pawloski, the reason that Intel hadn't moved to a superscalar architecture earlier was that the jump from 16-bit to 32-bit mode, while making sure that all existing 16-bit code ran swimmingly, was enough to keep Intel's engineering team occupied.
"At some point in time you don't want to bite off too much," he told us, "otherwise you're going to run into so many problems."
And problems did dog the P5, at least at first. There was, for example, an FPU bug that was the butt of many a joke, and the early 0.8-micron parts were roundly criticised for their toastiness – a problem that dissipated as the P5 architecture was moved to smaller processes and lower voltages.
Although the P5 had introduced superscalar architecture to the Intel line, Pawlowski contends that it was P6 design effort, begun in the early 1990s, that was the greatest achievement of that period.
"I contend that the success of that part," he said, "was because it brought in people that hadn't built the traditional lineage of x86 components" – architects such as Bob Colwell, Dave Papworth, and Mike Fetterman. "Those guys really made that machine," Pawlowski told us.
"There was a big argument between the Pentium and the P6 group, because the Pentium group felt that, 'Hey, that's probably not going to work, that's a huge step, x86 compatibility is going to really be tough'," he recalls.
"One of the reasons that I was brought into the program," he said, "was because I built PCs. In a lot of cases the individuals that were working in that program – because they were non-Intel or they hadn't been exposed to the PC side of the market – well, their feeling was 'We don't have to worry about being compatible, we're doing something new and different'."
That argument didn't cut it. "At the end of the day we said, 'You're going to be a PC, so you better get used to it'," he told us. "So what we did, in the group I was in, was we brought PC compatibility to the part." And x86 compatibility has remained a core tenet of Intel's chip development since.
Well, there is that little thing called the Itanium, but we digress.
You're either on the bus or you're off the bus
One major advance in the P6 architecture was the frontside bus. Before P6, interfaces between processors and the rest of the system were processor-specific. A true system bus, Pawlowski said, understands global addressability and not just processor I/O but system I/O, as well, and offers the opportunity to gang more than one processor and maintain cache coherency.
The P6's frontside bus used Gunning transceiver logic from Xerox, which was able to scale well and and continue to work as voltages declined. "We only thought it would last two generations, maybe two processor generations," Pawlowski said. Instead, it lasted for about a decade.
Another big step for the P6 architecture was out-of-order execution. "It had the reorder buffer," Pawlowski said. "It was able to look at more than three or four instructions at a time. Even if it could only decode and retire maybe three instructions at a time, it was able to have, potentially – gosh, if I remember right – I'm going to say 36 ... instructions that potentially could be in flight at any one time."
The P6's upgrades, he told us, helped that architecture achieve "performance improvements way above what we were getting with Pentium and the superscaler machine."
But perhaps the most radical – and radically effective – improvement in the P6 architecture, and one that helped out-of-order execution as well, was the translation of IA instructions into smaller, more granular micro-operations, or µops, which were more easily dispatched through the P6's out-of-order, superscalar architecture.
As Pawlowski told us, "As I keep telling people today, 'We really do binary translation in hardware in these machines'." The beauty part of binary translation, he said, is that such binary translation to µops can work with different architectures while still keeping full IA compatibility.
"You've got the flexibility of changing the underlying machine," he said, and then rattled off some of those changes. "Every process generation and processor generation, we add better branch prediction, we may add different functional units like the trace cache that was added on Willamette [the first Pentium 4] ... larger vector units, adding a vector unit with AVX  and then continuing to extend that, looking at ways to elide locks and make your locks faster but still maintain the semantics of locks because that's what programmers still use, but try to get the speed and limit the impact of contention so that we can just continually improve the processor performance."
All of those changes are more easily accomplished, Pawlowski said, in a processor that has full binary translation – and that's one of the things that the P6 brought to the party.
P6 lasted for three generations – the Pentium Pro, Pentium II, and Pentium III – but it was to make a comeback.
Feeling the strain
But before it did, there was work to be done on process technology, and the introduction of the first of the three major post-scaling technologies that Mark Bohr talked about: strained silicon.
In a highly simplified nutshell, strained silicon involves the material being stretched – or strained – in such a way as to pull the individual silicon atoms apart from one another. Doing so frees up the electrons and holes in the material, increasing their mobility substantially, thus allowing for lower-power transistor designs.
Although strained silicon had been under investigation at MIT and elsewhere, the early techniques were was biaxial – that is, the entire silicon lattice was stretched. Intel's breakthrough was the development of uniaxial stretching. Biaxial straining was good for nMOS but bad for pMOS, both of which need to be balanced for good transistor performance.
Biaxial straining also had problems with source drain and defects, Bohr told us – "not a very manufacturable technology". The uniaxial approach, however, could be applied "just to the pMOS device," Bohr said, "and it didn't have any significant yield issues, so it turned out to be both a high-performance solution and a good manufacturing solution."
But back to the departure and then the return of P6.
The follow-on architecture to P6 was NetBurst, and it was not exactly Intel's finest hour. By the time P6 had evolved into the Pentium III, its pipeline was just 10 stages long; NetBurst doubled that to 20 stages in the Willamette Pentium 4 in 2000, and increased that "Hyper Pipelined Technology" to 31 stages in the Prescott Pentium 4 in 2004 – which, by the way, was the first processor to use Bohr's 90nm strained silicon process technology.
According to Pawlowski, the reason for the deeper pipeline was "frequency, frequency, frequency". In a bit – well, more than a bit – of an oversimplification, deep pipelines require higher frequencies to achieve the same performance as architectures with shorter pipelines.
Seduced by the 'Megahertz Myth'
When NetBurst was introduced, the market had been taught to salivate when the high-clock-rate bell was rung. When we asked Pawlowski if Hyper Pipelined Technology and its high clock speeds was a marketing decision, he said, "It may have been a marketing decision, but that's what people bought at the time."
And the power required to goose those clock rates wasn't that big a problem at that time. "We were within a decent power envelope," Pawlowski told us. "The power envelope wasn't pushing 130 watts, maybe they were 40, 50, 60 watt parts." That said, he acknowledged that the message Intel wanted to send to the market was "'Hey, we've got the fastest gigahertz part'. That's what people were looking for."
There was also the fact, he admitted, that since the P6 architecture was such an improvement over P5, expectations for generation-to-generation performance improvements had been raised – including his own.
"When you get to the next part, you're kind of looking for 'How do we repeat history and do the same thing over and over again?'," he said. "You get spoiled, and you tend to get a little more aggressive, and you tend to think 'If this is important to me, then it must be important to the market'."
Unfortunately, the market had other ideas. "It wasn't until our customers said to us, 'We're not pushing socket power beyond 130 watts' – in the server space; in client it was certainly lower – 'We're not pushing that socket power any higher' that we had to have a wake-up call," he said.
There was an additional wake-up call, as well. "[AMD's] Opteron came out with a much more power-efficient architecture," he said. "They didn't focus on megahertz, but they got reasonable performance."
There was also the fact that the market was becoming more mobile, and NetBurst parts were unsuited for the cramped insides and relatively low-power capabilities of laptops and notebooks.
These were not the best of times for Intel. "I gotta admit," Pawlowski said, "when I left the labs and came to the product group, it was brutal, because in 2005 when we were really at the dip, at the low spot of where our architecture was competitively, because we were still pushing megahertz."
To make matters worse, he was getting needled about the competition. "I got the question, 'Why didn't you guys integrate the memory controller? How did little AMD just beat you guys to it?'" His response was: "They had nothing to lose. They really didn't."
Fortunately, as Pawlowski tells it, Intel's Israeli design team was working on a P6-based part in an effort to attempt to integrate an on-die memory controller with a Rambus  memory subsystem. That part never came to fruition, but some of the project's P6 refinements made it into the Pentium M , code-named Banias, and the Core microarchitecture, which helped salvage Intel's mobile future.
Getting high, m'K?
Although the disintegration of the P6-Rambus project turned out to be good for Intel in the long run, "It taught us one big lesson: you don't integrate new memory on your processor. You make sure that that memory technology is stable, and that you have a good supply from the memory industry because you don't want to impact shipping your processors."
That worked out just fine during the introduction of Nehalem in 2008, he said. "DDR3 was ready to go when Nehalem was getting ready to ship, so that we weren't inhibited because of memory supplies and we were able to ship Nehalem parts."
Among its many enhancements, Nehalem brought with it the QuickPath Interconnect (QPI ) to replace the frontside bus, and an integrated on-die memory controller – another holdover from the Israeli P6-cum-Rambus work.
From Pawlowski's point of view, Nehalem was an architectural change equivalent in depth and breadth to the move from P5 to P6 – not surprising, considering that the Core architecture was an outgrowth of P6.
"Once Nehalem came along and just built on the Banias architecture and integrated the memory controller," he said, "it was a sweet part."
We asked Pawlowski which was more important to Nehalem's success: QPI or the integrated memory controller. "I'd like to say that it was QPI because I was running the group that designed it," he told us, "but it was the integrated memory controller. Getting the memory latencies down from the average of probably 180, 190 nanoseconds on a frontside bus down to 60 to 80 nanoseconds or 105 nanoseconds was huge."
Nehalem was a 45nm part, and a follow-on to the first 45nm parts – code-named Penryn – which introduced the second of Bohr's process improvements, high-k metal gate transistors. High-k metal gate arrived with the first Penryn chips, the Xeon and Core 2 processors, which appeared in late 2007.
"High-k" means having a high dielectric, or insulating capability. In the case of Intel's implementation, the high-k material is hafnium-based.
"We had to come back to the gate-oxide issue," Bohr told us, referring to the leakage caused by increasingly thin oxide layers. "We needed another improvement, and high-k metal gate was that improvement."
For Intel, he said, the high-k metal gate solution provided more than one benefit. "Number one, it allowed us to thin the electrical thickness of the gate oxide – it's physically thicker, but being a high-k material it has increased gate capacitance, which means it has improved transistor performance," he said.
"The other benefit is that we changed the gate electrode from polysilicon to a metal material," Bohr continued, "and that helped to improve transistor performance." The metal alloys used in these electrodes are different for the pMOS and nMOS devices, and remain Intel trade secrets.
"The third benefit," Bohr said, "is that we did the high-k metal gate transistors using a new process flow called 'gate last', where we first make the transistors using normal polysilicon gate electrodes, and so everything is kind of the same – patterning the gate electrode, forming the source drain. But then in the midsection of the flow, we strip out the sacrificial polysilicon gate electrode and then replace it with the high-k and the metal gate material."
Not only does this new manufacturing process make possible the construction of the high-k metal gate transistors, Bohr says, it also helps to enhance strain. "When you pull out the sacrificial gate electrodes," he said, "then the source-drain regions are free to push or squeeze the channel area, and you get more strain out of the channel and then more performance."
One process change, three benefits – not too shabby.
The next phase in Bohr's triumvirate of process-technology improvements – Intel's recently announced tri-gate or "3D" transistor structure – will first hit the streets when the company's 22nm "Ivy Bridge" chips ship next year.
But 2012 will be the beginning of the microprocessor's next 40 years, and this Tuesday we're hoisting a pint to the processor that kicked off the first 40: the Intel 4004.
But we're not going to stop with a nostalgic wallow in 40 years of Chipzillian ups and downs. Both Pawlowski and Bohr filled us in on a few things they see coming in the next forty years.
Of course, the predictions get a wee bit less solid as the future decades roll on, but The Reg will pass them on to you in a future article. But here's one teaser: how about a computing system that can harvest and reuse the energy used to power its transistors, rather than merely expending it?
"There's a lot of fun stuff going on," Pawlowski told us. We'll tell you about it soon.
Stay tuned. ®