Mellanox pushes InfiniBand to 120Gb/s
Offloading IB processing from CPUs to ConnectX-2s
SC09 If 40 Gb/sec InfiniBand is not enough for you, then you'll be happy to hear that InfiniBand switch maker Mellanox Technologies is going to crank its switches past 11 to deliver 120 Gb/sec ports in its MTS and IS switch families.
As it turns out, the InfiniScale IV chips that Mellanox created for its 40 Gb/sec IS5000 switches, which were announced in June, are able to support 120Gb/s ports. The company said at the SC09 supercomputing trade show in Portland, Oregon, that in the first quarter of 2010, it will ship a variant of the MTS3600 fixed-port switch that employs CXP connections, which are already used with 40 Gb/sec InfiniBand by a number of vendors, that gang up 12 lanes of 10 Gb/sec InfiniBand traffic into a single port.
The MTS3600 comes in a 1U box that supports 36 40Gb/s ports, and a variant of this box supporting CXP links will come out with 12 ports running at 120Gb/s. The leaf modules in the IS5000 series of modular switches, which came out in June as well, have their port counts cut by a third and their bandwidth per port tripled.
Mellanox was demonstrating the new 120 Gb/sec switches at the show, and the high-speed InfiniBand switches were the backbone of a 400 Gb/sec network supporting the show that SCInet, the network provider for the SC trade shows, slapped together and has a value of $20m if you had to buy it. (The SC event does not have to buy its network, since vendors are thrilled to donate equipment and experts to be part of the high-speed backbone.)
Eventually, the 120 Gb/sec product line will include fixed-port switches with 12, 36, and 72 ports and hybrid 120Gb/sec and 40 Gb/sec switches that have six of the fast ports and 18 of the slower ones.
According to John Monson, vice president of marketing at Mellanox, the price per bit on the 120 Gb/sec variants of the switches will be the same as on the 40 Gb/sec, but the port costs will obviously triple. The reason why HPC shops would want to move to 120 Gb/sec is simple: by ganging up the InfiniBand pipes, users can cut the InfiniBand transport overhead by a factor of three and squeeze more performance out of their clusters. The switches will also allow for more bandwidth to be allocated in 3D torus systems between the cubes that make up the torus.
Enabling the upgrade to 120 Gb/sec is a "golden screwdriver" patch to the firmware, since the InfiniScale IV chips already supported the higher bandwidth, but obviously the CXP ports and their optical cables are different and you have to buy a new box to get them.
Mellanox also said at the SC09 show that the ConnectX-2 host channel adapters it has been shipping had another golden screwdriver upgrade. The prior generation of ConnectX IB cards allowed for some of the processing related to the InfiniBand protocol to be offloaded from CPUs in the server nodes to the IB cards. (Much as TCP/IP has had offloading for Ethernet cards for a number of years now.)
With the ConnectX-2 cards, Monson says that Mellanox has gone one step further and put electronics in the HCA that can take over some of the Message Passing Interface (MPI) collectives operations - those that broadcast data, gather data, or otherwise synchronize the nodes in a cluster.
Mellanox has done benchmark tests that show that clusters lose 20 to 30 per cent of their scalability from this MPI collectives communications, and that means cycles on the CPUs that could be doing other work are stuck doing this MPI work. Hence the offload, which is turned on with a firmware change on the existing cards and which cuts down on jitter and noise in the cluster and lets it get more work done.
Interestingly, the ConnectX-2 IB cards also have an embedded floating point co-processor, which can take over some of the calculation jobs that the MPI stack sends to a server node in the cluster, provided it has a Mellanox IB card.
Mellanox has been working with Oak Ridge and Los Alamos, two of the national laboratories funded by the US Department of Energy, to tweak the MPI stack to see this embedded floating point unit. Monson is mum about when this capability will
ship be activated. ®