Oracle's Sparc T4 chip: Will you pay Larry's premium?
Oomph for one thread. Or many
Hot Chips The Sparc T4 processor that Oracle expects to ship before the end of December for its entry-level and mid-range server platforms is probably the most important chip that either Sun or Oracle has put into the field since the dual-core UltraSparc-IV+ "Panther" arrived  in October 2004.
A lot of business is at stake – and so is the premise that Oracle can create integrated systems that provide benefits that can command a premium price.
The Panther chips saved Sun's cookies as the company killed off the "Millennium" UltraSparc-V processors and partnered with Fujitsu to deliver quad-core Sparc64 processors and a unified server line. The "Yosemite Falls" Sparc T4 chips are all Sun/Oracle, and they're about baking an entirely new kind of cookie that Oracle hopes will sell well, supporting Oracle's own database, middleware, and application software.
At the Hot Chips 23 conference at Stanford University late last week, Robert Golla, senior hardware architect at Oracle, and Paul Jordan, senior principal hardware engineer at the company, walked through some of the feeds and speeds of the upcoming Sparc T4 chip, which the company promised at Oracle OpenWorld last fall would be shipping before the end of 2011, keeping pace with the roadmap that Oracle created in the wake of its January 2010 acquisition of Sun for $7.6bn. Oracle revealed  that roadmap in August 2010, showing a steady cadence of hardware and operating system updates.
The Sparc T4 chips are presumably timed to hit the market with the impending Solaris 11, which has been in the making for more than six years and which presumably has been tuned to take every advantage of the Sparc T4 chips. The original Sun roadmap had a eight-core, eight-threaded Sparc T series chip coming out in the second half of 2011 for machines with one to four sockets implemented in a 40 nanometer process from Taiwan Semiconductor Manufacturing Corp. This Yosemite Falls processor was supposed to run at 2.5GHz and be based on a new Sparc T core code-named "VT," presumably short for "Virtual Threads" but neither Sun or Oracle have said what VT is short for (probably not Vermont).
Oracle has been hinting that this new VT core, which is now being called S3 we learn from the Hot Chips presentation, has a feature called the critical thread API . This feature allows a high priority application to grab one thread on a core and hog all of the resources on that core to significantly boost performance of that single thread; the other seven threads on the chip get told to sit tight. In the prior S1 and S2 cores, used in the prior Sparc T1, T2, T2+, and T3 processors, the threads were hard coded and their sharing algorithms were set in stone--etched in silicon, to be more precise.
Thread speed at the double
According to the presentation by Golla and Jordan at Hot Chips, the Sparc T4 will have double the per-thread performance of the T3 chip it will replace. This stands to reason since the Sparc T4 will have half as many cores, at eight, and will run its clocks at nearly twice the clock speeds, at a 3GHz or higher design speed. (The Sparc T4 chip will likely come in variants that run slower than this design speed.) because of the critical thread API support, will offer significantly better single-threaded performance. Oracle has also goosed the performance of the on-chip cryptographic units, too.
The Sparc T4 chip will have eight of the S3 cores, laid out thus:
Oracle's "Yellowstone Falls" SPARC T4 processor
The Sparc T4 chip is implemented in the same 40 nanometer process from TSMC that is used for the current Sparc T3 chips, so yields should be pretty good if the design is clean. (The Sparc T3 chips launched  in September 2010 for blade and rack servers spanning one, two, and four sockets and were added  to SuperCluster and Exalogic clusters in December 2010).
Each S3 core has a dual-issue, 16-stage integer pipeline that supports out of order execution and delivers around 5X the performance of the S2 core on the SPECint2006 benchmark. The S3 core has an 11-cycle floating point pipeline that offers up to 7X the performance of the S2 core on the SPECfp2006 test. The eight threads on the S3 core can be dynamically added or subtracted, depending on the per-thread performance level demanded by the applications.
Each S3 core has its own 16KB L1 instruction and 16KB L1 data cache, plus its own 128KB of L2 cache. It also includes special hardware optimizations for Oracle software, including user-level cryptographic instructions, a pause instruction, and a fused compare-branch instruction.
The SPARC T4 chip has a 4MB on-chip L3 cache that is shared by the eight cores, which is carved up into eight banks with 16-way association across the cache. (You can see them above and below the crossbar switch at the center of the chip.) The T4 chip has two dual-channel DDR3 memory controllers that support 1.07GHz memory, two PCI-Express 2.0 x8 controllers, and two 10 Gigabit Ethernet interfaces all etched into the silicon.
The whole shebang has around 855 million transistors, which makes it fairly small by enterprise chip standards these days. (This can be an asset, not a liability, especially when you are talking about performance per watt.)
The Sparc T3 chip, by contrast, did not have an L3 cache at all, either on or off the chip, and this no doubt affected its performance detrimentally. The S2 core in the Sparc T3 chip had 8KB of L1 data cache and 16KB of L1 instruction cache and 6MB of L2 cache shared by the cores. It had the same two PCI-Express 2.0 x8 and two 10GE interfaces. Both the Sparc T3 and T4 chips have electronics in their crossbar interconnect to allow for processors and their main memory to be gluelessly connected for two-way or four-way symmetric multiprocessing (SMP).
Eye on two prizes
Ultimately, what Oracle needs from the Sparc T4 chips is competitive performance on a wider range of workloads, not just applications, like middleware and databases, that like many threads. Plenty of Sparc/Solaris applications out there in the Sun user base are relatively monolithic and need more cache, more clock speed, and a greedier thread than prior Sparc T chips could deliver. The hybrid, dynamic thread approach that Oracle has cooked up seems to be up to the task, based on early performance benchmarks:
Relative performance of Sparc T series chips
As you can see from this chart, Sun/Oracle has done a good job on evolving multithreaded performance as gauged by the TPC-C online transaction processing benchmark. As the chip moved from the 32 threads in the S1 core used in the Sparc T1 to the 128 threads used in the S2 core used in the Sparc T3, the multithreaded performance of a single chip increased by a factor of about 3.75. That's about as good as it gets in terms of multiple threads.
But the single thread performance, as gauged by the SPECint2000 benchmark, hardly budged at all. However, thanks to the thread-hogging capabilities of the S3 core used in the Sparc T4 chip, single-thread performance is up by a factor of five. How this single-thread performance translates on real-world Sparc/Solaris workloads remains to be seen.
The crypto unit in the Sparc T4 chip and its S3 core has also had a working over. It supports AES, DES, Kasumi, Camellia, and CRC32c encryption or hashing in-pipe with a 3 cycle internal latency, plus MD5, SHA-1, SHA-256, SHA-512, MPMUL, MONTMUL, and MONTSQR out of pipe with a longer latency. Spending on the type of encryption, the new S3 core can deliver somewhere between 1.5X and 3.5X the performance of the cryto unit in the S2 core used in the Sparc T3 chip. This on-chip encryption is a big deal for databases. ®