Oracle wants to improve Linux load balancing and failover
Native to ordinary interfaces, Big Red reckons bonded channels are needed for RDMA
Oracle reckons Linux remote direct memory access (RDMA) implementations need features like high availability and load balancing, and hopes to sling code into the kernel to do exactly that.
The problem, as Oracle Linux kernel developer Sudhakar Dindukurti explained in this post, is that performance and security considerations mean RDMA adapters tie hardware to a “specific port and path”.
A standard network interface card, on the other hand, can choose which
netdev (network device) to use to send a packet. Failover and load balancing is native.
Dindukurti's work aims to bring that capability to both InfiniBand and RoCE (RDMA over Converged Ethernet) NICs – and to move it upstream from Oracle's Unbreakable Enterprise Kernel (UEK) to the Linux source code.
Its Resilient RDMA over IP (RDMAIP) creates a high availability connection, using active-active bonding to create a bonding group among an adapter's ports. If a port is lost, the traffic moves to the other ports in the group. This is done using Oracle's Reliable Datagram Sockets (RDS), which has been in the Linux kernel since 2009.
Extending this to Resilient RDMAIP involves a new process that lets a system send packets to remove nodes, as Oracle's post detailed:
- ”1) Client application registers the memory with the RDMA adapter and the RDMA adapter returns an R_Key for the registered memory region to the client. Note that the registration information is saved on the RDMA adapter;
- ”2) Client sends this "R_key" to the remote server;
- ”3) Server includes this R_key while requesting RDMA_READ/RDMA_WRITE to client”; and
- ”4) RDMA adapter on the client side uses the "R_key" to find the memory region and proceed with the transaction. Since the "R_key' is bound to a particular RDMA adapter, same R_KEY cannot be used to send the data over another RDMA adapter. Also, since RDMA applications can directly talk to the hardware, bypassing the kernel, traditional bonding (which lies in kernel) cannot provide HA.”
In a load balancing scenario, all the bonding group's interfaces have their own IP addresses, and the “consumer” – that is, an application or an OS process – decides how best to choose which interfaces to use.
Failover is easier, since RDMAIP spots an interface going down. The module moves the failed interface's IP address to another in the group, and an RDMA Communication Manager (RDMA CM) event notifies the relevant kernel processes to change the addresses they use.
Failback is handled the same way: the RDMAIP module moves the traffic back to the address that's recovered, and sends another RDMA CM message.
To get this Linux kernel-ready, Dindukurti wrote, the Resilient RDMAIP module needs to be more tightly coupled with the network stack implementation. That would allow RDMA kernel consumers to create active bonding groups, and provide APIs to expose bound groups and their interfaces. ®