Fujitsu parades 16-core Sparc64 super stunner
Top of the FLOPS
SC11 Ahead of the SC11 supercomputer conference in Seattle last week, recently awakened supercomputing giant Fujitsu rolled out the kicker: a commercialized version of the K supercomputer  that is at the top of the flops charts in the world right now.
A whole lot of details on the Sparc64-IXfx processor and the PrimeHPC FX10 systems were missing, but El Reg has chased them down just as Fujitsu has announced its first paying customer for the FX10 machines.
The K supercomputer is the first machine in the world to break through the 10 petaflops performance barrier as gauged by the Linpack Fortran benchmark test. It was built by Fujitsu for the Japanese government and is installed at the Rikagaku Kenkyusho (RIKEN) research lab in Kobe, Japan.
The K super is based on the "Venus" Sparc64-VIIIfx processor designed by Fujitsu and fabbed by Taiwan Semiconductor Manufacturing Corp. The eight-core Venus chip clocks at 2GHz and delivers 128 gigaflops per chip, has a thermal efficiency of around 2.2 gigaflops per watt, and dissipates around 58 watts.
Some nodes of Fujitsu's PrimeHPC FX10 supercomputer
The K super has 22,032 four-socket blade servers fitted into 864 server racks to bring 705,024 cores to bear on parallel computation jobs. Running Linpack, the K machine delivered 10.51 teraflops of sustained performance on the Linpack test, which is 93.2 per cent efficiency as lined up against its peak theoretical performance of 11.28 teraflops. The Torus Fusion, or Tofu, 6D mesh/torus interconnect that Fujitsu has cooked up is no doubt one of the secret sauces in the K and FX10 supers.
The PrimeHPC FX10 super uses double-stuffed 16-core Sparc64 processors, also designed by Fujitsu and fabbed by TSMC, and increases the rack count to 1,024.
Most of the feeds and speeds of the Sparc64-IXfx processor were not available two weeks ago when Fujitsu jumped the gun on the SC11 conference. We knew that the chip has 16 Sparc cores that run at 1.85GHz and delivers 236 gigaflops of double-precision floating point number crunching. Now we know what the chip looks like and some more stuff about it.
The Sparc64-IXfx chip has 85GB/sec of memory bandwidth and includes 12MB of L2 cache memory on the chip that is shared by all 16 of those cores. Fujitsu is not implementing a ring interconnect for those cores, as Intel is doing for future Xeon and Itanium processors, but rather is plunking a big L2 cache memory controller in the dead center of the chip and wrapping four banks of L2 cache memory around it. Two banks of cores are on the chip, top and bottom, with a DDR3 main memory controller implemented on each side of the L2 cache banks with memory interfaces out to the memory DIMMs.
The cores on the Sparc64-IXfx processor have 32KB of L1 data cache and 32KB of L1 instruction cache. The core has two integer units, two load/store units, and four floating point units that can execute two add or multiply instructions per clock. The chip can also allow a fat SIMD instruction to span two floating point units. The 16-core chip can do 128 floating point operations per clock, and at just a hair under 1.85GHz, you get 236 gigaflops peak theoretical performance.
The Sparc64-IXfx chip is implemented in a 40 nanometer process from TSMC and the die is nearly perfectly square at 21.9 millimeters by 22.1 millimeters. The chip has 1.87 billion transistors and 1,442 signal pins. During normal operations, Fujitsu says that the Sparc64-IXfx processor will burn about 110 watts.
At the top of the chip is an interface to the Tofu interconnect. Each processor socket in the K or FX10 machine has one of its own Tofu interconnect chips. This interconnect chip has a processor bus to link back to the Sparc64-IXfx processor, four Tofu network interfaces that handle packets coming off the processor and also provides remote direct memory access (RDMA) like InfiniBand does.
The interconnect chip has a Tofu barrier interface that handles collective operations, and a Tofu network router that has ten Tofu links. These links are used to hook the Tofu interconnect chips to up to ten other interconnect chips in the cluster, implementing the 6D mesh/torus when all the links are used.
The interconnect chip also has a PCI-Express 2.0 peripheral controller for linking out to storage and other peripherals. The interconnect chip is implemented in a fairly ancient 65 nanometer process and runs at 312.5MHz, which is a little less than one sixth the clock speed of the processor, and has ten bi-directional ports running at 5GB/sec this delivering a peak of 100GB/sec of switching capacity.
You have to think that Fujitsu wants to put the Tofu controller on the future Sparc64-Xfx processor, if there is such a thing. Or at least get it on the same chip package to further increase the density of the PrimeHPC clusters.
The PrimeHPC blade server with Tofu interconnect chips on the left
As with the K supers, there are four Sparc64-IXfx processors on each blade in the FX10 machine, with four matching Tofu interconnect chips. All eight chips on the blade are cooled with water blocks, which are attached to rear-door water jackets on the PrimeHPC racks.
The base PrimeHPC FX10 machine has 64 racks, as it turns out, and a loaded up rack costs about for ¥50m, or about $650,000 (£414,000), each. Those 64 racks have 6,144 compute nodes (four per blade) with 384TB of memory and 1.4 petaflops of peak number-crunching power; this configuration also has 384 I/O nodes, which have a total of 1,536 expansion slots.
This machine has about the same power efficiency as the K super, and burns 1.4 megawatts. A fully loaded 1,024-rack system would have 98,304 compute nodes, 6PB of main memory, and deliver 23 petaflops of oomph while burning 23 megawatts. Such a box would cost $655.4m at list price, but we're pretty sure Fujitsu will cut you a deal.
Fujitsu is ready to ship the PrimeHPC FX10 machines starting in January 2012, and the University of Tokyo's supercomputing division is the first customer to buy a PrimeHPC FX10 machine. The university is buying a 50-rack setup with 4,800 Sparc64-IXfx nodes with 150TB of memory and 1.13 petaflops of oomph. The FX10 machine at the University of Tokyo is front-ended by 16 Primergy RX200 S6 and 58 Primergy RX300 S6 servers that are being used as access controllers to the 1.13 petaflops monster.
The cluster is backed by 150 Eternus DX80 S2 RAID 5 storage arrays with 1.1PB of capacity, which are connected to the nodes directly, and 80 Eternus DX410 S2 arrays that are implemented using RAID 6 protection across their collective 2.1PB of capacity and shared by all nodes in the cluster.
The whole shebang runs the Fujitsu Exabyte File System , which also made its debut ahead of the SC11 show. FEFS is a variant of the open-source Lustre file system, and Fujitsu has committed to giving its enhancements to Lustre back to the community through a partnership with Whamcloud.
The latter company is offering third-party support for Lustre, which is technically controlled by Oracle since its acquisition of Sun Microsystems nearly two years ago. But Oracle doesn't care about HPC and therefore Whamcloud has forked Lustre and is offering support services to keep the big supercomputing labs of the world happy.
Fujitsu said it wanted to sell 50 of the PrimeHPC FX10 systems in the next three years, predominantly as a development machine for institutions that want to deploy applications on the K machine. One down, 49 to go. ®