Re: [RFC] RDMA bridge for ROCE and Infiniband

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Nov 12, 2021 at 01:04:13PM +0100, Christoph Lameter wrote:
> We have a larger Infiniband deployment and want to gradually move servers
> to a new fabric that is using RDMA over Ethernet using ROCE. The IB
> Fabric is mission critical and complex due to the need to have various
> ways to prevent failover with one of them being a secondary IB fabric.
> 
> It is not possible resourcewise to simply rebuild the system using ROCE
> and then switch over.
> 
> Both Fabrics use RDMA and we need some way for nodes on each cluster to
> communicate with each other. We do not really need memory to memory
> transfers. What we use from the RDMA
> stack is basically messaging and multicast. So the really
> core services of the RDMA stack are not needed. Also UD/UDP is sufficient,
> the other protocols may not be necessary to support.
> 
> Any ideas on how to do this would be appreciated. I have not found anything
> that could help us here, so we are interested in creating a new piece of
> Open Source RDMA software that allows the briding of native IB traffic to ROCE.
> 
> 
> Basic Design
> ------------
> Lets say we have a single system that functions as a *bridge* and has one
> interface going to Infiniband and one going to Ethernet with ROCE.
> 
> In our use case we do not need any "true" RDMA in the sense of memory to
> memory transfers. We only need UDP and UD messsaging and support for
> multicast.
> 
> In order to simplify the multicast aspects, the bridge will simply
> subscribe to the multicast groups of interest when the bridge software
> starts up.
> 
> PROXYARP for regular IP / IPoIB traffic
> --------------------------------------
> It is possible to do proxyarp on both sides of a Linux system that is
> connected both to Infiniband and ROCE. And thus this can already be seen
> as a single IP subnet from the Kernel stack perspective and communication
> is not a problem for non RDMA traffic.
> 
> PROXYARP means that the MAC address of the bridge is used for all IPoIB
> addresses on the IB side of the bridge. Similar the GID of the bridge is
> used in all IPoIB Packets on the IB side of the bridge that come from the
> ROCE side.
> 
> The kernel already removes and adds IPoIB and IP headers as needed. So
> this works for regular IP traffic but not for native IB / RDMA packets.
> IP traffic is only used for non performance critical aspects of
> our application and so the performance on this level is not a concern.
> 
> Each of the host in the bridge IP subnet has 3 addresses: An IPv4
> address, a MAC address and a GID.
> 
> 
> RDMA Packets (Native IB and ROCE)
> =================================
> ROCE v2 packets are basically IB packets with another header on top so
> the simplistic idealistic version of how this is going to
> work is by stripping and adding the UDP ROCE v2 headers to the IB packet.
> 
> ROCE packets
> ------------
> UDP roce packets send to the IP addresses on the ROCE side have the MAC
> address of the bridge. So these already contain the IP address for the
> other side that can be used to lookup the GID in order to convert the
> packet and forward it to the Infiniband node.
> 
> UD packets
> ----------
> Routing capabilities are limited on the Infiniband side but one could
> construct a way to map GIDs for the hosts on the ROCE side to the LID
> of the bridge by using the ACM daemon.
> 
> There will be complications regards to RDMA_CM support and the details
> of mapping characteristics between packets but hopefully this will
> be fairly manageable.
> 
> Multicast packets
> -----------------
> 
> Multicast packets can be converted easily since there is a direct
> mapping possible between the MAC address used for a Multicast group and
> the MGID in the Infiniband fabric. Otherwise this process is similar
> to UD/UDP traffic.
> 
> 
> Implementation
> ==============
> 
> There are basically three ways to implement this:
> 
> A) As add on to the RDMA stack in the Linux Kernel
> B) As a user space process that operates on RAW sockets and receives traffic
>    filtered by the NIC or by the kernel to process. It wouild use the same
>    raw sockets to send these packets to the other side.
> C) In firmware / logic of the NIC. This is out of reach of us here
>    I think.
> 
> 
> Inbound from ROCE
> -----------------
> This can be done like in the RXE driver. Simply listening to the UDP port
> for ROCE will get us the traffic we need to operate on.
> 
> Outbound to ROCE
> ----------------
> A Ethernet RAW socket will allow us to create arbitrary datagrams as needed.
> This has been widely done before in numerous settings and code is already
> open sources that does stuff like this. So this is fairly straightforward.
> 
> Inbound from Infiniband
> -----------------------
> The challenge here is to isolate the traffic that is destined for the bridge
> itself from traffic that needs to be forwarded. One way would be to force
> the inclusion of a Global Header in each packet so that the GID can be
> matched. When the GID does not match the bridge then the traffic would be
> forwarded to the code which could then do the necessary packet conversion.
> 
> I do not know of any way that something like this has been done before.
> Potentially this means working with flow steering and dealing with firmware
> issues in the NIC.
> 
> Outbound to Infiniband
> ----------------------
> I saw in a recent changelog for the Mellanox NICs that the ability has
> been added to send raw IB datagrams. If that can be used to construct
> a packet that is coming from one of the GIDs associated with the ROCE IP
> addresses then this will work.
> 
> Otherwise we need to have some way to set the GID for outbound packets
> to make this work.
> 
> The logic needed on Infiniband is similar to that required for an
> Infiniband router.
> 
> 
> 
> 
> The biggest risk here seems to be the Infiniband side of things. Is there
> a way to create a filter for the traffic we need?
> 
> Any tips and suggestions on how to approach this problem would be appreciated.

Mellanox has Skyway product, which is IB to ETH gateway.
https://www.nvidia.com/en-us/networking/infiniband/skyway/

I imagine that it can be extended to perform IB to RoCE too,
because it uses steering to perform IB to ETH translation.

Thanks

> 
> 
> Christoph Lameter, 12. November 2021
> 



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux