Pentium 4 castrated to keep costs down
Chip as launched a mere shadow of its former self
Not even Intel could afford to build Willamette aka Pentium 4 as it was originally designed, admitted a senior Intel engineer yesterday.
Darrell Boggs, Intel's principal engineer for the desktop platform group, told delegates at the Micro-33 conference in Monterey that the original P4 design would have been too large to build economically at 0.18 micron.
The design was change to minimise power consumption and to squeeze the die size to that of the original Pentium Pro, EE Times reports.
Originally, it was to have a megabyte of L3 cache, two fully functional floating-point units and a bigger execution trace cache and L1 cache. But it soon dawned on Intel that the chip size had grown too large. And this would mean serious problems for power consumption, complexity and testing, Boggs said.
"The general trend has been to make the CPU larger in physical area," he told EE Times. "But anytime you have a large die size, that means you have to have many fabs. You can become capacity-constrained unless you build a new fab."
The initial spec Willamette was "one slow ALU, two fast ALUs, two arithmetic address-generation units, two floating-point units, 16KB of L1 cache, 12,000 instructions of execution trace cache, 128KB of L2 cache, 1MB of external L3 cache, an allocator/register renamer and a bus architecture".
But for manufacturing cost reasons, the company had already decreed that the first 0.18 micron Pentium 4 die could not be any bigger than the first P6 microachitecture processor, the Pentium Pro, which was launched in 1995.
"If the first ones are large," said Boggs, "the next ones are going to have the propensity to be large. It was a very big issue for us."
Intel had to rip out some of the hardware and compromise performance to meet its die size and power requirement goals.
"There was tremendous die area and power associated with the floating-point units," Boggs said. As a result, the FPUs were modified to simply move data rather than execute MMX, SSE and SSE2 instructions. The upshot a was five per cent hit on performance, but the floating point real estate was squeezed to less than half its former size, he added.
Intel hobbled the execution trace cache, which was designed to compensate for P4's long instruction pipeline by cacheing only decoded micro-ops. As a compromise, Intel left the size of the trace cache at 12,000 instructions and developed a micro-op compression algorithm enabling micro-ops to be stored using fewer bits.
"That meant the execution trace cache had essentially the same performance but less die size," Boggs claimed.
The 1MB L3 cache was to be on a separate chip, as in the Pentium II and early Katmai PIIIs. But this would have meant an expensive cartridge to house it similar to the Slot One design currently being phased out.
Boggs finished his keynote by stating that all the obvious methods for improving processor performance have now been tried.
"The low hanging fruit is all gone," he said. "Now we have to build scaffolds around the tree. We'll stand on our head and do strange things for a little more performance." ®