On Fri, Nov 12, 2021 at 01:04:13PM +0100, Christoph Lameter wrote: > We have a larger Infiniband deployment and want to gradually move servers > to a new fabric that is using RDMA over Ethernet using ROCE. The IB > Fabric is mission critical and complex due to the need to have various > ways to prevent failover with one of them being a secondary IB fabric. > > It is not possible resourcewise to simply rebuild the system using ROCE > and then switch over. > > Both Fabrics use RDMA and we need some way for nodes on each cluster to > communicate with each other. We do not really need memory to memory > transfers. What we use from the RDMA > stack is basically messaging and multicast. So the really > core services of the RDMA stack are not needed. Also UD/UDP is sufficient, > the other protocols may not be necessary to support. > > Any ideas on how to do this would be appreciated. I have not found anything > that could help us here, so we are interested in creating a new piece of > Open Source RDMA software that allows the briding of native IB traffic to ROCE. > > > Basic Design > ------------ > Lets say we have a single system that functions as a *bridge* and has one > interface going to Infiniband and one going to Ethernet with ROCE. > > In our use case we do not need any "true" RDMA in the sense of memory to > memory transfers. We only need UDP and UD messsaging and support for > multicast. > > In order to simplify the multicast aspects, the bridge will simply > subscribe to the multicast groups of interest when the bridge software > starts up. > > PROXYARP for regular IP / IPoIB traffic > -------------------------------------- > It is possible to do proxyarp on both sides of a Linux system that is > connected both to Infiniband and ROCE. And thus this can already be seen > as a single IP subnet from the Kernel stack perspective and communication > is not a problem for non RDMA traffic. > > PROXYARP means that the MAC address of the bridge is used for all IPoIB > addresses on the IB side of the bridge. Similar the GID of the bridge is > used in all IPoIB Packets on the IB side of the bridge that come from the > ROCE side. > > The kernel already removes and adds IPoIB and IP headers as needed. So > this works for regular IP traffic but not for native IB / RDMA packets. > IP traffic is only used for non performance critical aspects of > our application and so the performance on this level is not a concern. > > Each of the host in the bridge IP subnet has 3 addresses: An IPv4 > address, a MAC address and a GID. > > > RDMA Packets (Native IB and ROCE) > ================================= > ROCE v2 packets are basically IB packets with another header on top so > the simplistic idealistic version of how this is going to > work is by stripping and adding the UDP ROCE v2 headers to the IB packet. > > ROCE packets > ------------ > UDP roce packets send to the IP addresses on the ROCE side have the MAC > address of the bridge. So these already contain the IP address for the > other side that can be used to lookup the GID in order to convert the > packet and forward it to the Infiniband node. > > UD packets > ---------- > Routing capabilities are limited on the Infiniband side but one could > construct a way to map GIDs for the hosts on the ROCE side to the LID > of the bridge by using the ACM daemon. > > There will be complications regards to RDMA_CM support and the details > of mapping characteristics between packets but hopefully this will > be fairly manageable. > > Multicast packets > ----------------- > > Multicast packets can be converted easily since there is a direct > mapping possible between the MAC address used for a Multicast group and > the MGID in the Infiniband fabric. Otherwise this process is similar > to UD/UDP traffic. > > > Implementation > ============== > > There are basically three ways to implement this: > > A) As add on to the RDMA stack in the Linux Kernel > B) As a user space process that operates on RAW sockets and receives traffic > filtered by the NIC or by the kernel to process. It wouild use the same > raw sockets to send these packets to the other side. > C) In firmware / logic of the NIC. This is out of reach of us here > I think. > > > Inbound from ROCE > ----------------- > This can be done like in the RXE driver. Simply listening to the UDP port > for ROCE will get us the traffic we need to operate on. > > Outbound to ROCE > ---------------- > A Ethernet RAW socket will allow us to create arbitrary datagrams as needed. > This has been widely done before in numerous settings and code is already > open sources that does stuff like this. So this is fairly straightforward. > > Inbound from Infiniband > ----------------------- > The challenge here is to isolate the traffic that is destined for the bridge > itself from traffic that needs to be forwarded. One way would be to force > the inclusion of a Global Header in each packet so that the GID can be > matched. When the GID does not match the bridge then the traffic would be > forwarded to the code which could then do the necessary packet conversion. > > I do not know of any way that something like this has been done before. > Potentially this means working with flow steering and dealing with firmware > issues in the NIC. > > Outbound to Infiniband > ---------------------- > I saw in a recent changelog for the Mellanox NICs that the ability has > been added to send raw IB datagrams. If that can be used to construct > a packet that is coming from one of the GIDs associated with the ROCE IP > addresses then this will work. > > Otherwise we need to have some way to set the GID for outbound packets > to make this work. > > The logic needed on Infiniband is similar to that required for an > Infiniband router. > > > > > The biggest risk here seems to be the Infiniband side of things. Is there > a way to create a filter for the traffic we need? > > Any tips and suggestions on how to approach this problem would be appreciated. Mellanox has Skyway product, which is IB to ETH gateway. https://www.nvidia.com/en-us/networking/infiniband/skyway/ I imagine that it can be extended to perform IB to RoCE too, because it uses steering to perform IB to ETH translation. Thanks > > > Christoph Lameter, 12. November 2021 >