Original URL: https://www.theregister.co.uk/2013/02/19/intel_qlogic_network_update/

Intel pits QDR-80 InfiniBand against Mellanox FDR

And hints at integrated networking on future CPUs

By Timothy Prickett Morgan

Posted in Networks, 19th February 2013 01:58 GMT

As Joe Yaworski, fabric product marketing manager at Intel, put it to El Reg, the next frontier of system innovation will be in fabrics. As we see it, that frontier may also be the front line of the InfiniBand war between Mellanox and Intel – with one upcoming battle being the former's 56Gb/sec interconnects versus Intel's new QDR-80.

Unfortunately, of all the interesting details that you may want to ask Intel about its what it plans to do the networking business, they're not really prepared to provide them. Well, most of them.

But fortunately, Intel does have to keep hinting. It needs to rebuff competition away from other suppliers of networking technology – and that's true both at the high end, where you'll find InfiniBand rival Mellanox, and at the low end, where an army of ARM suppliers are marrying low-powered processors and interconnects to disrupt the hegemony of Xeon servers and top-of-rack switches in the hyperscale data center.

Yaworski hails from the QLogic InfiniBand business that Intel acquired a year ago for $125m. That True Scale networking business, as the brand is now known, includes InfiniBand server host adapters as well as InfiniBand ASICs and whole switches.

But True Scale is only a piece of the networking future at Intel, and the company is looking to do more than supply raw components to switch and server makers.

Intel aims to double the revenue of its Data Center and Connected Systems Group to $20bn by 2016. It took ten years to double it to $10bn, a level the company hit in 2011, and that was done by expanding from server chipsets, processors, and motherboards out to storage components and then networking components.

With ambitions to be the supplier of components for future exascale class machines – capable of supporting maybe 100,000 server nodes with an exaflop of floating point performance and an exabyte of permanent storage of some kind – Intel needed to beef up its networking intellectual property, engineers, and marketeers.

With exascale systems being expensive and requiring tight integration of processing, networking, and memory components to lower power consumption, Intel has no choice but to become an expert in networks. And, as El Reg has contended for a number of years, Chipzilla may also have no choice but to return to the memory business from whence it sprang so many years ago, or else partner so tightly with a company such as Micron Technologies for 3D memory stacking that Micron would effectively become a division of Intel – much like HP and Dell are more or less informal divisions of Microsoft and Intel when it comes to PCs and servers.

To build up its networking portfolio, Intel started out by snapping up Fulcrum Microsystems, a maker of ASICs for Ethernet switches, in July 2011 for an undisclosed sum. (Probably a healthy premium over the $102m in venture capital that Fulcrum had raised, so for fun let's call it $160m.)

Then came the QLogic buy seven months later, followed by the acquisition of Cray's "Aries" butterfly supercomputer interconnect for $140m three months after that in April 2012. Intel already had a respectable business making chips for network interface cards, and if it didn't want to get into the switch racket with gusto, it could have stayed there.

Now Yaworski is being trotted out by Intel to talk about how True Scale 40Gb/sec QDR (Quad Data Rate) InfiniBand products are competing against rival products from Mellanox running at 56Gb/sec FDR (Fourteen Data Rate) speeds based on benchmark tests in some of the big supercomputing labs run by the US Department of Energy.

But Yaworski concedes that the topic of interest is what Intel will do with all of these networking assets – and we will get to that in a moment.

Back when the QLogic deal was done, Yaworski reminded El Reg of something that QLogic has been saying for years: InfiniBand, which was championed first and foremost by IBM and Intel, was not designed to be an interconnect for supercomputer or cloud clusters, but rather as a switched fabric for linking more modest numbers of servers to each other and to their peripherals.

IBM, by the way, uses InfiniBand in this manner today in its Power Systems machines, and has for more than a decade to link server processors directly to remote I/O drawers that house controllers, disk drives, and flash units.

The InfiniBand protocol was largely derived from an I/O paradigm, which made sense when it was designed to replace Ethernet and Fibre Channel. But when InfiniBand first came out in 2000, the mainstream Ethernet switches of the time ran at 10Mb/sec and had port-to-port hop latencies of 150 microseconds; InfiniBand could handle 10Gb/sec of bandwidth and deliver it with a 15 microsecond latency.

And so the message passing interface (MPI) protocol for supercomputer clusters was grafted onto InfiniBand, and a supercomputing niche was born.

QLogic spent years removing the verb architecture of InfiniBand and replacing it with its own "performance scaled messaging" reworking of MPI, which has a more lightweight interface and is very efficient with the kind of small messages that are typical of HPC applications and not typical for I/O workloads.

You have to be careful whenever any vendor whips out benchmark test results, and particularly in a realm where very few organizations have the kind of system scale that DOE supercomputing centers bring to bear. So get out your salt shaker and take a few grains.

Over at Sandia National Laboratory, where the US government is doing simulations to model the degradation of its nuclear weapons stockpile (to better maintain them for readiness) as part of the National Nuclear Security Administration, the boffins have been putting the new Chama system, one chunk of a 6 petaflops super that Appro International (now part of Cray)was commissioned to build back in June 2011. The other two bits of this Tri-Labs Compute Cluster 2 (TLCC2) system, by the way, are going into Lawrence Livermore and Los Alamos National Laboratories, and they are running tests as well.

Appro picked Intel Xeon E5 server nodes for the TLCC2 machine, and QLogic adapters and switches to lash them together. It was probably around here that Intel started to see the networking handwriting on the wall, and started thinking about buying the QLogic InfiniBand biz.

Eventually, Tri-Labs will share a machine with more than 20,000 nodes across the three sites, and this was the biggest deal that Appro ever did – and the wake-up call to Cray to buy Appro and get into the InfiniBand cluster racket proper.

Here's what Sandia found in its benchmark tests on three of its machines, which admittedly are using older Mellanox InfiniBand and Cray "Gemini" 3D torus interconnects compared to QDR InfiniBand links from Intel, and mixing and matching processor types and vintages, too:

Sandia stacks up its Chama, Red Sky, and Cielo supers

Sandia stacks up its Chama, Red Sky, and Cielo supers

There are many things that make these unfair comparisons, but the biggest one is the AVX floating point instructions in the Chama machine, which are absent from the two others. In fact, there are so many things that are different about these three systems that it would be hard to isolate the impact of choosing one interconnect over another. A proper test would pit identically configured nodes against each other with the different interconnects, but there is not enough time or money in the world to do that at DOE labs. (Hard to believe, but true.)

Here is another set of performance specs that came out of the Tri-Labs Appro machines, this one pitting Opteron nodes using Mellanox InfiniBand against Intel nodes using QLogic InfiniBand. This comparison is somewhat closer – but again, there are such substantial differences between the server nodes (in terms of configurations and count) that it is hard to call this an apples-to-apples comparison. Maybe quince-to-apple, or apple-to-pear would be closer:

Los Alamos pits the Typhoon Opteron/SwitchX cluster against the Luna Xeon/True Scale cluster

Los Alamos pits the Typhoon Opteron/SwitchX cluster against the Luna Xeon/True Scale cluster

The intent of this chart is simple enough. Both machines are hooked together with QDR InfiniBand, and the Luna machine using Intel processors and QLogic adapters and switches has 4.2 times the performance on a Tri-Labs workload. If you normalize that for the number of cores, the number of nodes, and the clock speed of the cores – that's an aggregate of 26,624GHz for the Typhoon box compared to an aggregate of 64,064GHz for Luna – then you would expected to see somewhere around a factor of 2.4 times higher performance in the Luna machine. Intel contends that its superior implementation of InfiniBand is what makes the difference.

Over at Lawrence Livermore, applications are seeing a 1.8X to 3X performance speedup that can be attributed to a move to Xeon E5 processors, but with the pairing of Xeon E5s with True Scale QDR InfiniBand, workloads are seeing a 4X to 4.7X speedup, says Yaworski, calling it a "better-together story."

El Reg isn't presenting this data to settle any arguments, but it will no doubt start a few – and rightly so. It takes a lot of data to make a case for superior performance of any technology.

Intel doubles up QDR for two-socket boxes

So, with Mellanox pushing 56Gb/sec InfiniBand on adapters that are well suited to the on-chip PCI-Express 3.0 ports on the Intel Xeon E5-2600 processors (you need the 2X increase in bandwidth over PCI-Express 2.0 to push the FDR InfiniBand card hard), what is Intel countering with? Something it calls QDR-80.

By the way, Yaworski tells El Reg that 75 per cent of the HPC market can be addressed by QDR InfiniBand, And if you really want to be honest, many clusters where latency or cost is more important than bandwidth are still being built with Gigabit Ethernet switches, although 10GE switches are catching on as they come down in price and offer very low latency.

Intel's QDR-80 gives each socket its own QDR InfiniBand adapter

Intel's QDR-80 gives each socket its own QDR InfiniBand adapter

When Intel was doing the QLogic deal, Yaworski was perfectly upfront with El Reg, saying that QLogic was not sure before the acquisition if it would do FDR InfiniBand – and in the wake of it, Intel remained unsure. The QDR-80 approach splits the difference by giving each socket in a two-socket server node its own QDR InfiniBand card rather than trying to push them both to talk over a single FDR InfiniBand card.

The important thing about QDR-80, says Yaworski, is that one socket does not have to send its traffic over the pair of QuickPath Interconnect (QPI) links that glue the two processors to each other to make a single system image. The Intel compilers know about QDR-80 and how to arrange code so it doesn't try to go over the QPI link.

It is hard to argue with this logic, particular when Yaworski says that the regular QLogic QDR adapters are cheap enough that you can have two of them for the price of a single FDR InfiniBand adapter.

That is, it may be hard to argue, but not impossible.

For one thing, twice as many adapters take twice as many slots and twice as many cables. And that latter bit is an issue, but doubling up components does have a certain amount of redundancy.

To El Reg's thinking, this QDR-80 approach might argue for a single-socket Xeon E3-1200 v2 server node with a QDR or FDR InfiniBand adapter welded right onto the motherboard. It is not like HPC customers put a lot of memory on their nodes, anyway – look at the data above. QDR-80 can be thought of as a kind of networking SMP or turning a two-socket box into a shared microserver pair.

So with InfiniBand and Ethernet both pushing up to 100GE speeds soon and more beyond that, what is Intel's plan for the future of InfiniBand?

Yaworski is not making any commitments. "Obviously, going above 80Gb/sec is a goal," says Yaworski, hedging with a laugh. "It will definitely be greater than 80Gb/sec and less than 200Gb/sec."

The real issue, of course, is not bandwidth. It's latency. And that is the stickler that is going to make exascale systems a challenge.

"We think we can drive it down lower," says Yaworski, and part of that is done by eliminating hops across the QPI links and the PCI-Express bus. "Every foot of cabling adds latency," so getting server nodes closer to each other is as important as getting network interfaces down onto the processors. But because of legacy peripheral needs, Intel does not expect to ditch the PCI bus any time soon, so don't get excited and have flashbacks to the original InfiniBand plan.

"We know that fabrics are the next bottleneck," says Yaworski. "We know we need a new initiative to address these needs. We know we need to drive the fabric down closer and closer to the processor, which drives up bandwidth and drives up scalability by reducing latency."

Intel is not making any promises about how and when it will add networking to Xeon processors, although we surmise that "Haswell" Xeons could have Ethernet ports on the die, and maybe, just maybe, "Ivy Bridge" Xeons not in the X3 family will, too.

Diane Bryant, general manager of Intel's Data Center and Connected Systems Group, admitted to El Reg last fall that the future "Avoton" 22-nanometer Atom S-series chip, due later this year, would have an Ethernet controller on the die.

This is in contrast to the distributed Layer 2 switch and virtual Ethernet ports that multicore ECX-1000 ARM server chips from Calxeda have on the die, which can lash up to 4,096 nodes into a network (with various topology options) and in future years will grow to do 100,000 nodes and more, if all the plans work out. All this without a single top-of-rack switch.

Think about that for a minute and see how this may upset Intel's switching-market dominance plans. Think about the lock-in that gives a vendor like Calxeda. Ask yourself why Calxeda is not just saying to hell with Dell and HP and building its own servers on Open Compute enclosures.

Intel, being professionally paranoid, is keenly aware of this and has spent what is probably on the order of $425m to buy those three networking companies and get their engineers focused on this and other problems in the hyperscale data centers of the future. The way Yaworski talks about it, InfiniBand is being positioned for the volume HPC space, and the Cray Aries interconnect (and, more importantly, its follow-ons) are being positioned for the real exascale systems.

What Intel almost certainly does not want to do is make custom versions of Xeon processors with hooks into the Cray interconnect that are distinct from Xeon processors that have controllers that speak InfiniBand or Ethernet (as the ConnectX adapters and SwitchX ASICs from Mellanox do).

In this thought experiment, what you want is for a Xeon chip to have network controllers that can speak Ethernet, InfiniBand, or Aries, and that link out through optical fiber ports out to an InfiniBand or Ethernet switch or an Aries router that is outside of the motherboard. Now you only have one Intel chip, but you can deploy it in many ways and not be bottlenecked by the PCI bus on the server. Yaworski would not confirm or deny that this is the plan.

"All I can say is that we will drive the fabric closer and closer to the CPU," he said, "and eventually achieve integration into the CPU that allows us to cover the key market segments that we are targeting, which is HPC, data center, and cloud." ®