On Fri, Sep 24, 2021 at 03:34:32PM +0000, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=214523 > > Bug ID: 214523 > Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP > updates during a reconnect > Product: Drivers > Version: 2.5 > Kernel Version: 5.14 > Hardware: All > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Infiniband/RDMA > Assignee: drivers_infiniband-rdma@xxxxxxxxxxxxxxxxxxxx > Reporter: kolga@xxxxxxxxxx > Regression: No > > RoCE RDMA connection uses CMA protocol to establish an RDMA connection. During > the setup the code uses hard coded timeout/retry values. These values are used > for when Connect Request is not being answered to to re-try the request. During > the re-try attempts the ARP updates of the destination server are ignored. > Current timeout values lead to 4+minutes long attempt at connecting to a server > that no longer owns the IP since the ARP update happens. > > The ask is to make the timeout/retry values configurable via procfs or sysfs. > This will allow for environments that use RoCE to reduce the timeouts to a more > reasonable values and be able to react to the ARP updates faster. Other CMA > users (eg IB or others) can continue to use existing values. > > The problem exist in all kernel versions but bugzilla is filed for 5.14 kernel. > > The use case is (RoCE-based) NFSoRDMA where a server went down and another > server was brought up in its place. RDMA layer introduces 4+ minutes in being > able to re-establish an RDMA connection and let IO resume, due to inability to > react to the ARP update. RDMA-CM has many different timeouts, so I hope that my answer is for the right timeout. We probably need to extend rdma_connect() to receive remote_cm_response_timeout value, so NFSoRDMA will set it to whatever value its appropriate. The timewait will be calculated based it in ib_send_cm_req(). Thanks > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are watching the assignee of the bug.