We have a larger Infiniband deployment and want to gradually move servers to a new fabric that is using RDMA over Ethernet using ROCE. The IB Fabric is mission critical and complex due to the need to have various ways to prevent failover with one of them being a secondary IB fabric. It is not possible resourcewise to simply rebuild the system using ROCE and then switch over. Both Fabrics use RDMA and we need some way for nodes on each cluster to communicate with each other. We do not really need memory to memory transfers. What we use from the RDMA stack is basically messaging and multicast. So the really core services of the RDMA stack are not needed. Also UD/UDP is sufficient, the other protocols may not be necessary to support. Any ideas on how to do this would be appreciated. I have not found anything that could help us here, so we are interested in creating a new piece of Open Source RDMA software that allows the briding of native IB traffic to ROCE. Basic Design ------------ Lets say we have a single system that functions as a *bridge* and has one interface going to Infiniband and one going to Ethernet with ROCE. In our use case we do not need any "true" RDMA in the sense of memory to memory transfers. We only need UDP and UD messsaging and support for multicast. In order to simplify the multicast aspects, the bridge will simply subscribe to the multicast groups of interest when the bridge software starts up. PROXYARP for regular IP / IPoIB traffic -------------------------------------- It is possible to do proxyarp on both sides of a Linux system that is connected both to Infiniband and ROCE. And thus this can already be seen as a single IP subnet from the Kernel stack perspective and communication is not a problem for non RDMA traffic. PROXYARP means that the MAC address of the bridge is used for all IPoIB addresses on the IB side of the bridge. Similar the GID of the bridge is used in all IPoIB Packets on the IB side of the bridge that come from the ROCE side. The kernel already removes and adds IPoIB and IP headers as needed. So this works for regular IP traffic but not for native IB / RDMA packets. IP traffic is only used for non performance critical aspects of our application and so the performance on this level is not a concern. Each of the host in the bridge IP subnet has 3 addresses: An IPv4 address, a MAC address and a GID. RDMA Packets (Native IB and ROCE) ================================= ROCE v2 packets are basically IB packets with another header on top so the simplistic idealistic version of how this is going to work is by stripping and adding the UDP ROCE v2 headers to the IB packet. ROCE packets ------------ UDP roce packets send to the IP addresses on the ROCE side have the MAC address of the bridge. So these already contain the IP address for the other side that can be used to lookup the GID in order to convert the packet and forward it to the Infiniband node. UD packets ---------- Routing capabilities are limited on the Infiniband side but one could construct a way to map GIDs for the hosts on the ROCE side to the LID of the bridge by using the ACM daemon. There will be complications regards to RDMA_CM support and the details of mapping characteristics between packets but hopefully this will be fairly manageable. Multicast packets ----------------- Multicast packets can be converted easily since there is a direct mapping possible between the MAC address used for a Multicast group and the MGID in the Infiniband fabric. Otherwise this process is similar to UD/UDP traffic. Implementation ============== There are basically three ways to implement this: A) As add on to the RDMA stack in the Linux Kernel B) As a user space process that operates on RAW sockets and receives traffic filtered by the NIC or by the kernel to process. It wouild use the same raw sockets to send these packets to the other side. C) In firmware / logic of the NIC. This is out of reach of us here I think. Inbound from ROCE ----------------- This can be done like in the RXE driver. Simply listening to the UDP port for ROCE will get us the traffic we need to operate on. Outbound to ROCE ---------------- A Ethernet RAW socket will allow us to create arbitrary datagrams as needed. This has been widely done before in numerous settings and code is already open sources that does stuff like this. So this is fairly straightforward. Inbound from Infiniband ----------------------- The challenge here is to isolate the traffic that is destined for the bridge itself from traffic that needs to be forwarded. One way would be to force the inclusion of a Global Header in each packet so that the GID can be matched. When the GID does not match the bridge then the traffic would be forwarded to the code which could then do the necessary packet conversion. I do not know of any way that something like this has been done before. Potentially this means working with flow steering and dealing with firmware issues in the NIC. Outbound to Infiniband ---------------------- I saw in a recent changelog for the Mellanox NICs that the ability has been added to send raw IB datagrams. If that can be used to construct a packet that is coming from one of the GIDs associated with the ROCE IP addresses then this will work. Otherwise we need to have some way to set the GID for outbound packets to make this work. The logic needed on Infiniband is similar to that required for an Infiniband router. The biggest risk here seems to be the Infiniband side of things. Is there a way to create a filter for the traffic we need? Any tips and suggestions on how to approach this problem would be appreciated. Christoph Lameter, 12. November 2021