Intel doubles up QDR for two-socket boxes
So, with Mellanox pushing 56Gb/sec InfiniBand on adapters that are well suited to the on-chip PCI-Express 3.0 ports on the Intel Xeon E5-2600 processors (you need the 2X increase in bandwidth over PCI-Express 2.0 to push the FDR InfiniBand card hard), what is Intel countering with? Something it calls QDR-80.
By the way, Yaworski tells El Reg that 75 per cent of the HPC market can be addressed by QDR InfiniBand, And if you really want to be honest, many clusters where latency or cost is more important than bandwidth are still being built with Gigabit Ethernet switches, although 10GE switches are catching on as they come down in price and offer very low latency.
Intel's QDR-80 gives each socket its own QDR InfiniBand adapter
When Intel was doing the QLogic deal, Yaworski was perfectly upfront with El Reg, saying that QLogic was not sure before the acquisition if it would do FDR InfiniBand – and in the wake of it, Intel remained unsure. The QDR-80 approach splits the difference by giving each socket in a two-socket server node its own QDR InfiniBand card rather than trying to push them both to talk over a single FDR InfiniBand card.
The important thing about QDR-80, says Yaworski, is that one socket does not have to send its traffic over the pair of QuickPath Interconnect (QPI) links that glue the two processors to each other to make a single system image. The Intel compilers know about QDR-80 and how to arrange code so it doesn't try to go over the QPI link.
It is hard to argue with this logic, particular when Yaworski says that the regular QLogic QDR adapters are cheap enough that you can have two of them for the price of a single FDR InfiniBand adapter.
That is, it may be hard to argue, but not impossible.
For one thing, twice as many adapters take twice as many slots and twice as many cables. And that latter bit is an issue, but doubling up components does have a certain amount of redundancy.
To El Reg's thinking, this QDR-80 approach might argue for a single-socket Xeon E3-1200 v2 server node with a QDR or FDR InfiniBand adapter welded right onto the motherboard. It is not like HPC customers put a lot of memory on their nodes, anyway – look at the data above. QDR-80 can be thought of as a kind of networking SMP or turning a two-socket box into a shared microserver pair.
So with InfiniBand and Ethernet both pushing up to 100GE speeds soon and more beyond that, what is Intel's plan for the future of InfiniBand?
Yaworski is not making any commitments. "Obviously, going above 80Gb/sec is a goal," says Yaworski, hedging with a laugh. "It will definitely be greater than 80Gb/sec and less than 200Gb/sec."
The real issue, of course, is not bandwidth. It's latency. And that is the stickler that is going to make exascale systems a challenge.
"We think we can drive it down lower," says Yaworski, and part of that is done by eliminating hops across the QPI links and the PCI-Express bus. "Every foot of cabling adds latency," so getting server nodes closer to each other is as important as getting network interfaces down onto the processors. But because of legacy peripheral needs, Intel does not expect to ditch the PCI bus any time soon, so don't get excited and have flashbacks to the original InfiniBand plan.
"We know that fabrics are the next bottleneck," says Yaworski. "We know we need a new initiative to address these needs. We know we need to drive the fabric down closer and closer to the processor, which drives up bandwidth and drives up scalability by reducing latency."
Intel is not making any promises about how and when it will add networking to Xeon processors, although we surmise that "Haswell" Xeons could have Ethernet ports on the die, and maybe, just maybe, "Ivy Bridge" Xeons not in the X3 family will, too.
Diane Bryant, general manager of Intel's Data Center and Connected Systems Group, admitted to El Reg last fall that the future "Avoton" 22-nanometer Atom S-series chip, due later this year, would have an Ethernet controller on the die.
This is in contrast to the distributed Layer 2 switch and virtual Ethernet ports that multicore ECX-1000 ARM server chips from Calxeda have on the die, which can lash up to 4,096 nodes into a network (with various topology options) and in future years will grow to do 100,000 nodes and more, if all the plans work out. All this without a single top-of-rack switch.
Think about that for a minute and see how this may upset Intel's switching-market dominance plans. Think about the lock-in that gives a vendor like Calxeda. Ask yourself why Calxeda is not just saying to hell with Dell and HP and building its own servers on Open Compute enclosures.
Intel, being professionally paranoid, is keenly aware of this and has spent what is probably on the order of $425m to buy those three networking companies and get their engineers focused on this and other problems in the hyperscale data centers of the future. The way Yaworski talks about it, InfiniBand is being positioned for the volume HPC space, and the Cray Aries interconnect (and, more importantly, its follow-ons) are being positioned for the real exascale systems.
What Intel almost certainly does not want to do is make custom versions of Xeon processors with hooks into the Cray interconnect that are distinct from Xeon processors that have controllers that speak InfiniBand or Ethernet (as the ConnectX adapters and SwitchX ASICs from Mellanox do).
In this thought experiment, what you want is for a Xeon chip to have network controllers that can speak Ethernet, InfiniBand, or Aries, and that link out through optical fiber ports out to an InfiniBand or Ethernet switch or an Aries router that is outside of the motherboard. Now you only have one Intel chip, but you can deploy it in many ways and not be bottlenecked by the PCI bus on the server. Yaworski would not confirm or deny that this is the plan.
"All I can say is that we will drive the fabric closer and closer to the CPU," he said, "and eventually achieve integration into the CPU that allows us to cover the key market segments that we are targeting, which is HPC, data center, and cloud." ®
Re: This doesn't make sense
I don't think it doesn't make sense. :-) Looking at this from a sysadmin POV (I'm not an applied maths whiz, but I've worked for and with some):
- Unless your job is embarrassingly parallel, your cluster nodes will need to communicate with each other, not just with the filesystem.
- The pattern and amount of that communication depends on the type and scale of the job.
- As more cores end up inside each compute node, the interconnect has to scale up in speed for some sorts of jobs (definitely for all-to-all patterns) to get the same throughput per core as used to occur when each node had fewer cores. There is also more RAM in each node, and hence more checkpoint data to be saved in the I/O phase - but the I/O phase is likely limited more by the filesystem, unless you're using some fancy two-stage checkpoint setup (i.e. quick dump to dedicated checkpointing system that can then stage it out to the filesystem).
Re: This doesn't make sense
Why would the IO phase need to write out the sum of all data transmitted during the compute phase ?
Not all of the traffic during the compute phase is data, some is control/synchronisation from MPI. Also any data transmitted during the compute phase may have been updated, discarded or expired during compute - the end state or result isn't necessarily the sum of all of the data.
Additionally, the MPI traffic can be quite bursty - so while you want the interconnect to be capable of high performance, it's capacity (bandwidth) isn't usually the limiting factor; the speed (latency or message rate) is typically more important.
It is perfectly acceptable to have MPI (compute) and filesystem (IO) traffic on the same interconnect. IB has QoS features to ensure certain traffic types can be prioritised e.g. MPI packets, to ensure latency in the compute phase doesn't skyrocket if there is also IO traffic on the network.
In fact if you have a multi-user, multi-job cluster with jobs starting and stopping asynchronously it's pretty much inevitable that you will have a mix of traffic at any given time on the cluster.
Re: This doesn't make sense
Let's try a very rough back of the envelope estimation. Assuming that the network is fully utilised during compute phase, the IO phase would be at least as long as compute phase, because the total state of computation must be dumped at it is more data than exchanged during compute phase. Worse yet, IO phase cannot be overlapped with compute phase of another task, because they compete for the fully utilised network. Which means that even in the ideal state, when the storage system is so blazingly fast that it is the network which is the bottleneck of IO phase, the duty cycle of the system is less than 50%. A reason for a big lab administrator to have a heart attack. The system seems to be misconfigured.