Original URL: https://www.theregister.com/2010/04/19/rdma_over_ethernet/

Ethernet borgs RDMA from InfiniBand

Rocky protocol means 10 GE is gonna fly now

By Timothy Prickett Morgan

Posted in Networks, 19th April 2010 19:18 GMT

The InfiniBand Trade Association, the champion of the InfiniBand protocol, has announced that after a year and a half of development, it's releasing the spec for its technological crown jewels - for use in its most notable rival, Ethernet.

There are a number of things that keep InfiniBand networks relevant in a world increasingly dominated by Ethernet. One of them is the extra bandwidth that InfiniBand switches and adapters can bring to bear as they link servers and storage. Another is extremely low latency, which is made possible in large part to a memory access technique that cuts out a lot of the network stack - and the spec for that, oddly enough, is what the IBTA is releasing.

The key InfiniBand technology is called Remote Direct Memory Access, and as the name suggests, it allows servers talking to each other through InfiniBand fabrics to reach out across the wires and talk directly to each others' main and cache memories. With Ethernet networks, if you want something from a peer server in the network, you have to be all polite and talk through the TCP/IP stack and the I/O peripherals, which adds heaps of latency to the request for data.

But not much longer. With RDMA over Converged Ethernet, or RoCE for short (and pronounced "Rocky," as in the Italian Stallion), to put it whimsically, Ethernet is gonna fly now.

The specification that the IBTA has cooked up applies to current 10 Gigabit Ethernet devices as well as future 40 GE and 100 GE products. The RDMA features that have been grafted onto Ethernet follow the borging of other InfiniBand technologies to create what is called Converged Enhanced Ethernet, such as multilane priority scheduling, I/O virtualization, lossless data transfer, Layer 2 network multipathing, and hardware congestion management. Now you can add RDMA and the low latency it engenders in InfiniBand switches.

These CEE features are what have allowed the convergence of server and storage switching thanks to the Fibre Channel over Ethernet (FCoE) protocol, where servers and storage arrays think they are running Fibre Channel switches, but are actually running the Fibre Channel protocol atop an Ethernet backbone. Similarly, with RoCE, the servers will be thinking they are talking RDMA over InfiniBand when it is really being encapsulated and sent over an Ethernet backbone.

The software side of RoCE is being managed by the OpenFabrics Alliance's Enterprise Distribution (OFED) consortium, and with 1.5.1 drivers will now support RoCE running on Ethernet adapters. According to Brian Sparks, senior director of marketing at InfiniBand switch-maker Mellanox and co-chair of the IBTA's marketing working group, some of the Ethernet adapter card makers already have support for RoCE in beta, and announcements will be rolling out shortly.

So what is the latency customers can expect to see using RoCE? We're talking around 1.3 microseconds for RDMA over Ethernet compared to sub-microsecond performance on actual InfiniBand networks using RDMA, according to Paul Grun, the chief scientist at System Fabric Works and a member of the IBTA steering committee. That's still a lot better than the 4.5 microseconds a very fast 10 Gigabit Ethernet switch delivers, at least for customers where every fraction of a microsecond counts.

So does this signal the death knell for InfiniBand? Probably not. But it's sure looking like an octogenarian's birthday party.

Sparks does not agree. "We're trying to provide the best-in-class performance, regardless of fabric," he explains, adding that InfiniBand is already at 40 Gb/sec and will be pushing on to 100 Gb/sec soon, while Ethernet is still at 10 Gb/sec speeds. "InfiniBand will continue to do performance jumps ahead of Ethernet, as it has always done and that will matter to a lot of customers. But for the folks that have invested a lot in Ethernet, just upgrading NICs and getting RDMA over Ethernet will be very attractive."

By the way, RoCE is not just a gussied up version of the Internet Wide Area RDMA Protocol (iWARP), which encapsulated RDMA traffic at a high level in the stack, used TCP/IP drivers running on servers to transmit data from machine to machine, and used TCP/IP's error-correction methods. With RoCE, the servers are using the InfiniBand transport and Layer 3 networking drivers, but swap out the InfiniBand link layer with an Ethernet link layer. This is a much cleaner and presumably higher-performing implementation of RDMA and will likely see much broader adoption. ®