Enter the Dragon: The Chinese superputer set to win the Top500 crown

Original URL: https://www.theregister.com/2013/06/10/inside_chinas_tianhe2_massive_hybrid_supercomputer/

Full specs on the 17-megawatt 'Sky River'

Posted in HPC, 10th June 2013 07:22 GMT

The Chinese government can keep a secret probably as well as any organisation on the planet, but pride will get the best of anyone - at least some of the time.

And so it is that Xiangke Liao, the professor from the National University of Defense Technology has outed the details of the "Tianhe-2" massively parallel supercomputer at the International HPC Forum in Changsha, China.

As El Reg has previously reported, some of the details were almost immediately leaked from the meeting. Jack Dongarra, of Linpack benchmark fame and himself a professor at the University of Tennessee, was at the IHPCF meeting and put together a report about Tianhe-2 from the presentation that Xiangke gave on the machine that will almost certainly rank first on the Top500 supercomputer list in a little more than a week.

The Tianhe-2 machine is significant for a number of different reasons. First, once again China is taking the top-flopper crown away from the United States, and has left supercomputing efforts in Europe and Japan in the dust - at least in terms of raw floating point performance.

What the Tianhe-2 super should look like in its final home

Tianhe-2 also shows, for those who don't already get it, that China is not only deadly serious about massively parallel computing, but also about using petascale and someday exascale computing as a means of bolstering its military, its industry and its climate modelling.

And as is the case with all of the superpowers, it is sometimes hard to tell the difference between military and commercial interests. It would be a fair statement to say that China has more money to splurge these days, and is not reluctant to do so. And over the long haul, if you believe that HPC is a driver of industry, that is going to be a problem for the US, Japanese, and European economies if they don't get their HPC priorities straight and explain to people that this is just like an Apollo moon shot, but one that has to be sustained over decades.

That said, the fact that Tianhe-2 is based on Intel Xeon processors and Xeon Phi coprocessors shows that despite China's efforts to deliver its own MIPS-derived Godson processors as well as its own variants of the Sparc processor, it has not been able to get a completely indigenous machine into the field that sits atop the Top500 list, or even among the highest ranking machines. This is no doubt annoying to the Chinese government and the supercomputing community in the Middle Kingdom - and it shows that chip innovation is not as easy as it looks. (Or, more precisely, it is as difficult as it looks.)

Make no mistakes, however. The Tianhe-1A and Tianhe-2 machines built by China are really about designing and testing the homegrown "Arch" interconnect and getting experience in massively parallel applications on the way to building a completely indigenous machine. As long as China has the cash, it can buy development time for its fledgling chip industry by using processors and coprocessors from American suppliers. It can take as much time as it needs to, but the big fear in the U S of A is that China will create a complete supercomputing stack and want to sell it here and in Europe. How do you say no to a country that you owe $2 trillion to? It was far easier to accuse Hitachi and Fujitsu of dumping back in the 1990s and block the sale of their vector supercomputers through massive tariffs to protect Cray and, indirectly, others.

Tianhe-2 is also noteworthy because it is the first time since the ASCI Red machine was installed at Sandia National Laboratories – one of the big Department of Energy facilities in the US – that Intel is at the pinnacle of the Top500 rankings with a machine that is based solely on its motors. Sure, the Tianhe-1A machine that topped the list three years ago had Xeon processors, but it also had Nvidia Tesla coprocessors. And Nvidia was bidding on the upgrade to Tianhe-2, but Chipzilla apparently made the Chinese government an offer it wouldn't refuse. Or, perhaps it was the other way around for all we know. When you are buying 32,000 processors and 48,000 coprocessors, the customer, not the vendor, has the most leverage.

Up to a point, of course. Intel is not in the business of giving chips away, particularly if it wants the US supercomputing centers to adopt a ceepie-phibie approach for their next wave of systems. It is a delicate balance and all parties are inscrutable. The scuttlebutt is that Tianhe-2 is significantly more expensive than some low-ball figures that are floating around – as low as $100m – but until this system hack sees the purchase order, I don't believe any of this idle talk. There are too many people who want to give the impression that Intel gave the motors in this machine away to suit their own purposes. Intel has no interest in telling people what Tianhe-2 costs and is no doubt being muzzled by the Chinese government anyway. Everyone is talking about it costing between $200m and $300m, but the Chinese government has not said yet what it is paying to make Tianhe-2.

Intel – and Sparc and Arch – Inside

The configuration of the Tianhe-2 system is more or less as we described it last week when the rumors started coming out. We had originally speculated that Intel might be supplying a forthcoming, higher-end Xeon Phi 7000 series coprocessor for the machine, which is expected to be launched soon alongside some other 3000 and 5000 series cards. But after mulling it over, we reckoned that what NUDT might do instead is match up the soon-to-be-announced "Ivy Bridge-EP" Xeon E5 v2 processors from Intel with an embedded (meaning smaller and cheaper) version of a 3000 series Xeon Phi. The latter turns out to be the case, although it is much less exciting. The Xeon Phi 3000 parts are no doubt going to be less expensive than the Xeon Phi 7000s, and as with the Xeon and Itanium processors, the top-bin SKUs always offer worse bang for the buck than the standard parts and are aimed at those users who are absolutely desperate for peak performance and are willing to pay through the nose for it.

According to the report put together by Dongarra, Tianhe-2 is currently installed at NUDT, not in a particularly optimal layout, and will eventually be moved to the National Supercomputer Center in Guangzhou (NSCC-GZ) by the end of the year.

The system boards and system installation and testing were done by Inspur, one of several aggressive server makers in China. (As has been the case in the United States for decades, the Chinese government likes to spread around HPC system manufacturing – and therefore its risk – across multiple suppliers. In this case Dawning and Inspur are getting some of the bigger deals. This is a luxury the US could afford in the past but may not be able to sustain in the future; China will be able to afford it unless a ginormous meteor slams into the Great Wall.)

The chip that Intel didn't want you to know about

On the x86 portion of the Tianhe-2 system, NUDT has opted for a top-bin Ivy Bridge-EP part, and Dongarra says it is, in fact, a twelve-core Xeon E5 v2 chip (specifically, the E5-2692) running at 2.2GHz. Intel is sure to be annoyed by the release of this information, since it has not announced the core count of these processors. There has been some yammering about a ten-core part up until now, but now we know there is a twelve-core part. Intel is not expected to announce these Xeon E5 v2 processors until the fall, most likely at Intel Developer Forum in September.

But as was the case with the eight-core "Sandy Bridge-EP" Xeon E5 v1 chips, Intel has ramped up production ahead of the formal launch and cherry-picked some big HPC customers to show off the processor. As El Reg said last week, perhaps NUDT will be the only customer showing off the new Xeon parts in HPC iron if Cray is not ready to get the chips into its "Cascade" XC30 and XS300 - formerly Appro Xtreme-X - machines. Appro was the chosen one last time around, Hewlett-Packard is talking too much about Moonshot and ARM servers to be HP's favorite, and Cisco Systems got its turn in the sun with the Xeon 5600 launch four years ago. IBM was looking to sell its x86 server business a few months ago (and still might be so inclined) and has those annoying Power Systems and System z mainframe lines that just won't roll over and die. (They seem to have taken a rest by the side of the road in recent quarters, to be sure, and have lost some of their luster in HPC, too.)

The layout of the Tianhe-2 compute blades is a bit funky, and in a good way that packs more computing elements in a rack than did the Tianhe-1A system from three years ago. And those computing elements – both the Xeon CPUs and the Xeon Phi x86 coprocessors – have significantly more floating point oomph. The result is a lot more flops per rack.

The Tianhe-1A system had the Arch network interface card and two two-socket Xeon processors on the top portion of the compute module and two Nvidia Tesla cards, attached by PCI-Express 2.0 x16 links to the processors, in the bottom portion of the module.

Oops, I think I broke it

With Tianhe-2, the Arch interconnect and two Ivy Bridge-EP compute nodes appear to be on a single circuit board (even though they are logically distinct). The compute node plus one Xeon Phi coprocessor share the left half of the compute node and five Xeon Phis share the right side, and the two sides can be pulled out separately. There's a reason for this.

In the famous picture of the Tianhe-1A super at the National Supercomputing Center in Tianjin that shows a technician pulling out a compute node, the node was so heavy that he actually jammed it and could not get it to slide back into the system.

So this time around, with even more iron in the drawer, Inspur broke the drawer into two halves for easier maintenance. They link through what NUDT called a "horizontal blind push-pull structure," which is a funky way of saying that the circuit board on the compute side has evenly spaced connectors that mesh with ports coming off the back-end of the modified Xeon Phi 3000 Series cards. You pop some mechanism and these connectors open up and you can slide one half of the drawer out to tweak the iron as needed.

The Tianhe-2 Xeon Phi drawer in action

The Xeon Phi cards only support PCI-Express 2.0 connections at the moment, but Tianhe-2 is tricked out in PCI-Express 3.0 x16 connections and is therefore ready for whatever Xeon Phi coprocessor Intel might come up with down the road. (Say, perhaps, a future "Knights Landing" Xeon Phi card with a lot more cores and a lot more oomph and an upgraded Arch interconnect, which is limited to PCI-Express 2.0 at the moment, too?)

Each Xeon E5 v3 node has 64GB of main memory. The 32,000 processors in the machine have a total of 384,000 cores, and at the 2.2GHz clock speed that yields 6.76 petaflops of aggregate peak floating point performance for the x86 portion of the Tianhe-2 machine.

That special version of the Xeon Phi that NUDT is getting is called the 31S1P, and the P is short for passive cooling. Dongarra said that this Xeon Phi card had 57 cores and was rated at 1.003 teraflops at double-precision floating point, and that is precisely the same feeds and speeds as the 3120A launched back in November with active cooling (meaning a fan). That 3120A had only 6GB of GDDR5 graphics memory, and the 31S1P that NUDT is getting has 8GB like the Xeon Phi 5110P card, which has 60 cores activated, which runs at a slightly slower clock speed, and which burns less juice and creates less heat. It is also 33 per cent more expensive at $2,649 in single-unit quantities. Anyway, with 48,000 of these bad boys, the Xeon Phi part of Tianhe-2 has 2.74 million cores and delivers a peak DP performance of 48.14 petaflops. Add 'em up, and you get 54.9 petaflops peak.

The Tianhe-2, like its predecessor, is also front-ended by a homegrown Sparc-based cluster. NUDT has created its own variant of the Sparc chip, called the Galaxy FT-1500, which has sixteen cores, runs at 1.8GHz, is etched in 40 nanometer processes, burns around 65 watts, and delivers around 144 gigaflops of double-precision performance. The front-end processor for the Tianhe-2 machine has 4,096 of these processors in its nodes, which gives another 590 teraflops. But, who's counting?

Dongarra has seen a Linpack parallel Fortran benchmark run on a portion of the Tianhe-2 machine with 14,336 of the nodes fired up and 50GB of memory on each node (that is a strange number); this box had a peak DP performance of 49.19 petaflops, and managed 30.65 petaflops on Linpack, achieving a 62.3 per cent computational efficiency. That is more than enough to take the top slot in the June Top500 supercomputer rankings.

The most interesting part of the Tianhe-2 system is probably the Arch interconnect, which is also known as TH Express-2 according to Dongarra. The heart of the Arch interconnect is a high-radix router, just like ~~Cray's~~ Intel's "Gemini" and "Aries" interconnects, and like Aries is also uses a combination of electrical cables for short haul jumps and optical cables for long haul jumps. And like InfiniBand networks, Arch uses a fat tree topology, which is why many believe that Arch is, in fact, a goosed up version of InfiniBand, but NUDT is claiming it is a proprietary protocol and frankly, we are in no position to argue.

The TH Express-2 Arch interconnect created by NUDT

The Arch network has thirteen switches, each with 576 ports, at its top level, and that router chip, called the NRC and presumably short for network router chip, has a throughput of 2.76Tb/sec. The Arch network interface chip, called NIC of course, has a lot fewer pins (675 compared to 2,577 in the router) but is designed to have the same die size. This Arch NIC hooks into PCI-Express 2.0 slots on the compute nodes and it looks like there is one Arch port per node. These NIC ports hook into a passive electrical backplane and link local server nodes to each other. The thirteen 576-port switches are used to link groups of racks to each other in the fat tree setup. (Much like top-of-rack and end-of-row aggregation switches do in InfiniBand and Ethernet networks.) Presumably this is done with optical cables. The backplane is running at 10Gb/sec or 14Gb/sec, according to the presentation by Xiangke, and it is not clear if these are two different dialable bandwidths or if different parts of the backplane must run at these two different speeds.

Dongarra said that a broadcast MPI operation was able to run at 6.36GB/sec on the Arch interconnect with a latency of around 9 microseconds with a 1KB data packet across 12,000 of the nodes in Tianhe-2A system. Arch is said to be using a proprietary protocol, but I would bet it is a hodge-podge of different technologies and very likely a superset of InfiniBand. But that is just a hunch, and there are people much more qualified than this system hack to make better guesses.

The whole shebang has 125 compute racks and the thirteen Arch switch racks for a total of 138 racks, plus another 24 racks for storage (12.4PB in size) if you want to count that in the system. It runs the Kylin variant of Linux created by NUDT for the Chinese military as well as the H2FS file system. This beast consumes 17.4MW under load, and the closed-coupled, chilled water cooling system is designed to handle 24MW and - as it turns out - will warm up the water supply in Guangzhou. ®