How to network at a supercomputing show
SC09 There was plenty of noise coming out of the networking vendors, who glue supercomputing nodes and their storage to each other at last week's SC09 supercomputing trade show in Portland, Oregon.
El Reg has already told you about QLogic getting reseller deals for its quad data rate InfiniBand host channel adapter from Dell, Hewlett-Packard, IBM, and Silicon Graphics. We've also mentioned Mellanox pushing out 120 Gb/sec InfiniBand switches and animating latent InfiniBand offloading capabilities in its ConnectX-2 host channel adapters. Asustek and Micro Star also announced ahead of SC09 that they were tapping Mellanox IB chips to put QDR InfiniBand right on their motherboards, something designed to appeal to the HPC set.
Making the switch to MPI work
Voltaire, the other big supplier of InfiniBand and now 10 Gigabit Ethernet switches, had some news last week, too. First, HP has inked a deal to resell Voltaire's Vantage 8500 10 Gigabit Ethernet switches, which was a perfectly predictable event. The Vantage 8500 switch, which offers 288 ports, started shipping in June. IBM signed up to resell the layer 2 core switches in October. HP also said at the show that it would resell Voltaire's Grid Director 4700 QDR InfiniBand director switch, which sports 324 ports and the ability to double them up. HP is also selling the Unified Fabric Manager software that Voltaire has created for its IB and 10 GE switches.
Having HP peddling its gear is important to Voltaire's top and bottom line. While HP doesn't get much glory on the Top 500 supercomputer list - and this is something that HP has to fix, perhaps by buying SGI or Cray or both because IT is political and supercomputing is extremely political, and hence a lever to move the military-industrial complex - it certainly does have a pretty broad presence in HPC thanks to the acquisition of Compaq nearly a decade ago. (Convex helped make the Superdomes, but HP really didn't chase HP deals much after it killed off the DEC Alpha chips and their systems.)
According to Asaf Somekh, vice president of marketing at Voltaire, HP has sold hundreds of thousands of InfiniBand ports (which can run to over $1,000 a pop on the switch side alone). So having HP on board - which wants to do its own networking with its ProCurve and soon-to-be 3Com switches - will be important. Perhaps it will even be important enough for HP to nab either Mellanox or Voltaire, in fact, once it is done digesting 3Com.
Voltaire can't worry about that right now, but is committed to expanding its sales channel for both InfiniBand and 10 Gigabit Ethernet switches. "Others will follow," says Somekh. "It won't be just an HP and IBM play."
In addition to the HP reseller deal, Voltaire said that it has made enhancements to its Unified Fabric Manager software. First, Voltaire has announced an add-on (meaning, not free) software module called UFM Fabric Collective Accelerator that optimizes the collective operations of the Message Passing Interface protocol.
Sometimes, during a parallel calculation, a result being calculated in one node in a cluster is dependent on many or all of the other nodes in the cluster. This is the collective operations, bringing the data back to that needy node through MPI. The Grid Director 4700 switch can run this add-on code, which means that CPUs on the cluster do not have to run these MPI functions and therefore can get more work done themselves.
Mellanox, as we pointed out above, is offloading some MPI functions to its combined IB-Ethernet host channel adapters. Voltaire is putting it in the switch. Maybe if you mixed Voltaire switches and Mellanox adapters, you could eliminate the servers altogether?
Voltaire's Vantage 8500 switches will get the Collective Accelerator feature, too, alongside the InfiniBand director switches (both 20 Gb/sec and 40 Gb/sec versions) when UFM 3.0 is released in the first quarter of 2010. The current release of UFM is 2.2, and it does not have this MPI offloading capability.
The performance enhancement that HPC customers will see from UFM will vary by workload, but Somekh says the tuning that comes from the UFM range in early customer trials is "from dozens to hundreds of percent improvement" in network performance. As for the Collective Accelerator module, Somekh says that the offloading to the switches can reduce MPI collective operations by 90 per cent, cutting total MPI runtime by as much as 40 per cent.
Another new feature of UFM is called Adaptive Suite, which is a bundling of Adaptive Computing's Moab cluster management tool. This can orchestrate the provisioning of cluster resources through individual server, storage, network, operating system, and application provisioning tools. You could think of Moab as air traffic control, and the other tools as pilots that listen to ATC. The integrated UFM-Moab product will also come with UFM 3.0, too.
Fujitsu does switches, too
Server maker Fujitsu has a division called Frontech that makes ATMs, point of sale terminals, other display devices and, believe it or not, network switches. So Fujitsu was also on hand at SC09, not just to talk about its future eight-core Sparc64-VIIIfx processors and the supers that will use it, but also layer 2 switches.
Specifically, Fujitsu announced the XG2600, a 10 GE layer 2 switch that puts 26 ports into a 1U chassis. It uses SFP+ optical modules and can use SFP+ twinax copper cables. The unit's spec sheet says it can deliver up to 520 Gb/sec of aggregate bandwidth with switching latency as low as 300 nanoseconds. Fujitsu is also claiming it can deliver this kind of performance at under 5 watts of power consumption per port.
Those are pretty good numbers when you consider that Arista Networks was on the show floor bragging that in actual benchmark tests, its 7148SX 48-port 10 GE SFP+ layer 2/3 switch was able to demonstrate "extraordinarily low latency" of 600 nanoseconds. (You can see the benchmark tests validating this performance here.) When you read the report, you see that 48 ports is a lot to cram into a 1U form factor, and that average latency is more like 1,273 nanoseconds.
Arista, like many switch makers, is using silicon from Fulcrum Microsystems. Fulcrum were also at the show peddling a whitebox switch along with partner Teranetics that uses its FocalPoint FM4224 10 GE switch chip - the same one used by Arista. This has been paired with Teranetics' dual-port, triple-rate TN20225 10GBase-T physical device to make a 1U switch that has 20 10GBaseT ports and four SFP+ ports.
This whitebox - code-named "Monte Carlo" - is available on an OEM basis for $900 per port. So, if you want to try to take on Andy Bechtolsheim, one of the founders of Sun Microsystems and the brains behind Arista Networks, here's your chance.
InfiniBand, 10 GE, and Gigabit Ethernet in HPC
One last interesting bit of networking news coming out of SC09 last week: the distribution of interconnects among the Top 500 supers ranking. Obviously, the fastest 500 machines are not indicative of the current state of cluster interconnects, but a kind of leading indicator to what will be normal sometime down the road and what is fading from the market.
For all the talk about 10 GE switches, there is only one machine using that technology on the current Top 500 list. There are, by contrast, 13 machines using QDR InfiniBand, another 31 using DDR InfiniBand, and another 137 using regular old, 10 Gb/sec InfiniBand. Mellanox says it has a 37 per cent share of the InfiniBand switches (by machine count, not ports) of the Top 500 list, and Voltaire says that it has just north of 50 per cent IB share on the list.
But there are plenty of cheapskates, even in the upper echelon of supercomputing. Another 258 machines are based on - we can say it - unimpressive Gigabit Ethernet switching between supercomputing nodes. Just remember, it isn't how big your switch is, but what you do with it that counts. The New York Stock Exchange has Gigabit Ethernet guts.
There are another 15 machines using Cray's "SeaStar" XT family interconnect, three using Quadrics interconnects (but Quadrics is dead, so that will change soon enough), seven using Myrinet interconnect, three using SGI's NUMAlink, and the rest using a variety of federation, fat tree network, 3D torus or proprietary interconnects. ®