Oi! You noisy servers! Talk among yourselves and stop bothering that poor router!
RDMA-over-Ethernet steps up to v 2.0, promises less chatter so servers can get on with it
The group behind the RDMA over converged Ethernet standard – RoCE to its friends – is tweaking the spec to support UPD and IP in the stack.
RDMA - remote direct memory access - has become increasingly important in large-scale data centres, since it lets data move between different servers' user space without having to drop down through the stack and get put into TCP/IP packets.
The resulting efficiency and low node-to-node latency is particularly important in highly virtualised environments, which is why (for example) Microsoft has made much of its support in Azure.
However, there are reasons to want some routing capability in the RoCE world. As Mellanox's Bill Lee, chairman of the Infiniband Trade Association, told The Register, as the scale of racks increases, with each of them acting as their own layer 2 domain, a measure of isolation becomes desirable.
“You need routing between the layer 2 networks,” Lee said. Hence RoCE v2, which slips UDP and IP in that part of the RoCE stack that formerly supported only Infiniband: the new version establishes “east-west” communication in the RoCE world.
RDMA over Converged Ethernet gets a slice of IP in the stack
“We have established the layer 2 routing by adding a UDP header into the RoCE v2 specification,” Lee explained. Because the change only affects layer 3, he said, RoCE remains transparent at layer 2.
The change also maintains backwards compatibility, Lee noted: “applications that work above the Infiniband transport layer are still working; the existing Ethernet fabric works, and it's agnostic to the frame that comes from the IP layer”.
Emulex's Mike Jochimsen, a member of the Infiniband Trade Association working group, explained that the change is “designed for a demand from of data centres who are isolating server clusters, and bridging them with their own subnets.
“They wanted RoCE to span across layer 3. Ethernet with priority flow control is the most optimal way of developing a fabric that will support RoCE and RoCE 2, and RoCE is suitable for multiple subnets within a fabric”.
The UDP/IP frame is simple, carrying the IP header, IP protocol number (ie, the next bit will be UDP), UDP header, and UDP port number, and the RDMA “EtherType” frame now indicates whether what follows will be Infiniband or UDP/IP. The Ethernet header, Infiniband header, payload, ICRC and FCS frames are unchanged. ®