Chip makers to strut their stuff at Hot Chips 23
Many-cored versus monoliths
The Hot Chips 23 symposium on high-performance chips kicks off at Stanford University next week. The makers of processors for smartphones, desktops, servers, and networking gear are polishing up their powerpoints to amaze and daze each other from August 17 through 19.
On the traditional server front, Intel is on deck to talk about the next generation eight-core "Poulson" Itanium processors, which the company detailed quite a bit back in March at the IEEE's International Solid-State Circuits Conference in San Francisco. Those presentations hit the chip community a few weeks before Oracle caused a ruckus when it said it would not support future versions of its database, middleware, and application software on these future Poulson processors. Hewlett-Packard, the biggest and nearly the sole user of the Itanium chip, has subsequently sued Oracle in California over the withdrawal of support, and last week antitrust authorities in Spain said they were looking into the potential of abuse by Oracle.
Speaking of Oracle, that company is a server chip maker and it has not given presentations at ISSCC or Hot Chips since it took over Sun Microsystems in January 2010. But this time around, Oracle will be trotting out two architects responsible for its forthcoming Sparc-T4 processor, due later this year. The Sparc T4 processors will have eight cores and implement a new core that Sun called "VT". That designation probably does not refer to the state of Vermont, but rather virtual threads and in contrast to the hard-coded threads in earlier Sparc T designs. The new core, Oracle has explained, allows a high priority application to grab one thread on a core and hog all of the resources on that core to significantly boost performance of that single thread. Oracle has been referring to this trick as the "critical thread API".
IBM will be trotting out a geared-down 16-core Power A2 processor that will be at the heart of the future BlueGene/Q massively parallel supercomputer, and the Chinese Academy of Science will be polishing up the ISSCC presentations it made already earlier this year for its "Godson" family of MIPS-compatible processors, which have an x86 emulation mode and which are extremely interesting for a lot of other reasons. The Godson-3C processors are expected with 8 to 16 cores running at between 1.5GHz and 2GHz sometime in late 2011 or early 2012. China has plans to create Godson variants suitable for consumer electronics, PCs, notebooks, tablets, embedded applications, servers, and supercomputers. The Godson-3 design includes instructions added to help the QEMU hypervisor (the one that's at the heart of Red Hat's KVM hypervisor) to translate instructions from x86 to MIPS format, with an emulation penalty of about 30 per cent.
Another MIPS chip revealing its transistors at Hot Chips is the 32-core Octeon II CN6880, from Cavium Networks, which tends to make products aimed at the networking space rather than servers. But there is no reason why someone can't – or won't – try to put the Octeon II into a server if it has price/performance or performance advantages.
Upstart many-core chip maker Tilera will be presenting at Hot Chips, too, showing off its Tile-Gx family chips for servers, networking gear, and other devices. The Tile-Gx3000 series of chips are aimed at servers and will have 36, 64, or 100 cores. Depending on the workload, a single Tile-Gx3000 can deliver the performance of a two-socket x64 server and do so in about a quarter of the space. The 36-core Tile-Gx3036 chips started sampling in July and are expected to appear in products by the end of the year, The The Tile-Gx3064 and Gx3100 processors, with 64 cores and 100 cores, respectively, will sample in early 2012 and will appear in products about six months later if all goes according to plan.
SeaMicro, which has been making a big splash with its Atom-based, dense-packed microservers, is not revealing the innards of its custom ASICs in the SM10000 systems, but will give a presentation that generally talks about the pros and cons of building data center servers using "cell phone chips", as the company put it.
On the desktop front, Intel and Advanced Micro Devices will be talking about their respective "Sandy Bridge" Core and "Llano" Fusion processors.
Neither company will be discussing their forthcoming server processors in the x64 racket, which would be Intel's eight-core "Sandy Bridge-EN" and "Sandy Bridge-EP" Xeon E5 chips and AMD's 16-core "Interlagos" Opteron 6200s. The Opteron 6200s look to launch in the third quarter, with the Xeon E5s expected in the fourth quarter. AMD is giving a presentation on the "Bulldozer" core design, which is employed in the Opteron 6200 processors as well as in desktop and workstation processors.
Facebook system engineer Amir Michael will be joined by Bill Dally, chief scientist at Nvdia, and Allen Baum, a chip architect at Intel, to talk about the Open Compute Project, which launched open source server and data center designs back in April. These servers are built by Facebook and its partner Quanta Computer and are used in the company's Prineville, Oregon data center. ®
Crysis Lan Party...
I expect that critical threads will only be used for highly optimised code that does minimal memory access and can be accommodated in the first level cache.
I'm sure they created this feature after carefully analysing real world applications, it wouldn't have been dreamt up by the marketing department. In this era of slowing CPU progress, optimisations like this could offer considerable performance & competitive advantage (or help them catch up, as the case may be).
Yes, that is correct. With many threads AND the ability to switch thread in one clock cycle - you can efficiently hide latencies. Normal cpus switch threads in 100s of clock cycles, which means you can not mask latencies.
For instance, studies by Intel shows that a normal server x86 cpu, idles 50-60% of the time - under max load. Under full load - a typical x86 cpu waits for data 50-60% of the time. This means a x86 cpu running at 3GHz, is actually doing work corresponding to a 1.5GHz cpu.
That is the reason normal cpus have big caches, complex prefetch logic, etc - to try to minimize latencies. CPUs have reach high GHz, but RAM is still slow. Thus, if you have a 5GHz cpu and the RAM is 1GHz - then the CPU needs to wait for RAM all the time. But if both the cpu and RAM runs at 1GHz, then cpu need not to wait. Thus, high clocked cpus are not really meaningul. 5GHz POWER6 cpus using 1GHz RAM is really pointless. Even IBM seems to understand this now, as IBM has decreased clock speed and increased the nr of cores.
So, how successful is the Niagara approach? Well, the Niagara idles 5-10% under full load - waiting for data. That is much better than 50-60%. Thus, the Niagara at 1.6GHz competes with, and outperforms in some cases, much higher clocked cpus. In fact, Niagara holds several world records today, beating mich higher clocked x86 and POWER7 cpus.
The funny thing is that Niagara has a tiny cache, because it hides latencies very well. Thus, Niagara is fastest in the world in some cases, without big caches. What does that prove? It proves that Niagara is not cache starved. If it were cache starved, it would never beat 5 GHz cpus. You need 14 (fourteen) POWER6 at 5GHz to match four (4) Niagara T2+ cpus at 1.6GHz in official SIEBEL v8 benchmarks. How is that possible if the Niagara is cache starved?
Conclusion, the ability to hide latencies can be very valuable.
Not sure how much sense this sun virtual thread stuff will make
the point about hardware threading is mainly (as I understand it) to hide the many latencies, most especially memory latencies. If you cycle through (say) four hardware threads then you have a reasonable chance that an outstanding delay will have been resolved by the time you get back to the first thread (external memory accesses probably being an exception).
If you simply turn off all but one thread, it will sit and do much thumb-twiddling at assorted traffic lights while the others don't get anything done either (because you disabled them ), so I wonder how much of a boost this might be. Possibly much smaller than the marketing drones would like to present it.
Also... priority inversion, anyone?
(I am not a CPU designer, any who are may correct me).