> On Sep 27, 2021, at 8:09 AM, Leon Romanovsky <leon@xxxxxxxxxx> wrote: > > On Sun, Sep 26, 2021 at 05:36:01PM +0000, Chuck Lever III wrote: >> Hi Leon- >> >> Thanks for the suggestion! More below. >> >>> On Sep 26, 2021, at 4:02 AM, Leon Romanovsky <leon@xxxxxxxxxx> wrote: >>> >>> On Fri, Sep 24, 2021 at 03:34:32PM +0000, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote: >>>> https://bugzilla.kernel.org/show_bug.cgi?id=214523 >>>> >>>> Bug ID: 214523 >>>> Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP >>>> updates during a reconnect >>>> Product: Drivers >>>> Version: 2.5 >>>> Kernel Version: 5.14 >>>> Hardware: All >>>> OS: Linux >>>> Tree: Mainline >>>> Status: NEW >>>> Severity: normal >>>> Priority: P1 >>>> Component: Infiniband/RDMA >>>> Assignee: drivers_infiniband-rdma@xxxxxxxxxxxxxxxxxxxx >>>> Reporter: kolga@xxxxxxxxxx >>>> Regression: No >>>> >>>> RoCE RDMA connection uses CMA protocol to establish an RDMA connection. During >>>> the setup the code uses hard coded timeout/retry values. These values are used >>>> for when Connect Request is not being answered to to re-try the request. During >>>> the re-try attempts the ARP updates of the destination server are ignored. >>>> Current timeout values lead to 4+minutes long attempt at connecting to a server >>>> that no longer owns the IP since the ARP update happens. >>>> >>>> The ask is to make the timeout/retry values configurable via procfs or sysfs. >>>> This will allow for environments that use RoCE to reduce the timeouts to a more >>>> reasonable values and be able to react to the ARP updates faster. Other CMA >>>> users (eg IB or others) can continue to use existing values. >> >> I would rather not add a user-facing tunable. The fabric should >> be better at detecting addressing changes within a reasonable >> time. It would be helpful to provide a history of why the ARP >> timeout is so lax -- do certain ULPs rely on it being long? > > I don't know about ULPs and ARPs, but how to calculate TimeWait is > described in the spec. > > Regarding tunable, I agree. Because it needs to be per-connection, most > likely not many people in the world will success to configure it properly. Exactly. >>>> The problem exist in all kernel versions but bugzilla is filed for 5.14 kernel. >>>> >>>> The use case is (RoCE-based) NFSoRDMA where a server went down and another >>>> server was brought up in its place. RDMA layer introduces 4+ minutes in being >>>> able to re-establish an RDMA connection and let IO resume, due to inability to >>>> react to the ARP update. >>> >>> RDMA-CM has many different timeouts, so I hope that my answer is for the >>> right timeout. >>> >>> We probably need to extend rdma_connect() to receive remote_cm_response_timeout >>> value, so NFSoRDMA will set it to whatever value its appropriate. >>> >>> The timewait will be calculated based it in ib_send_cm_req(). >> >> I hope a mechanism can be found that behaves the same or nearly the >> same way for all RDMA fabrics. > > It depends on the fabric itself, in every network > remote_cm_response_timeout can be different. What I mean is I hope a way can be found so that RDMA consumers do not have to be aware of the fabric differences. >> For those who are not NFS-savvy: >> >> Simple NFS server failover is typically implemented with a heartbeat >> between two similar platforms that both access the same backend >> storage. When one platform fails, the other detects it and takes over >> the failing platform's IP address. Clients detect connection loss >> with the failing platform, and upon reconnection to that IP address >> are transparently directed to the other platform. >> >> NFS server vendors have tried to extend this behavior to RDMA fabrics, >> with varying degrees of success. >> >> In addition to enforcing availability SLAs, the time it takes to >> re-establish a working connection is critical for NFSv4 because each >> client maintains a lease to prevent the server from purging open and >> lock state. If the reconnect takes too long, the client's lease is >> jeopardized because other clients can then access files that client >> might still have locked or open. >> >> >> -- >> Chuck Lever -- Chuck Lever