https://bugzilla.kernel.org/show_bug.cgi?id=214523 Bug ID: 214523 Summary: RDMA Mellanox RoCE drivers are unresponsive to ARP updates during a reconnect Product: Drivers Version: 2.5 Kernel Version: 5.14 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Infiniband/RDMA Assignee: drivers_infiniband-rdma@xxxxxxxxxxxxxxxxxxxx Reporter: kolga@xxxxxxxxxx Regression: No RoCE RDMA connection uses CMA protocol to establish an RDMA connection. During the setup the code uses hard coded timeout/retry values. These values are used for when Connect Request is not being answered to to re-try the request. During the re-try attempts the ARP updates of the destination server are ignored. Current timeout values lead to 4+minutes long attempt at connecting to a server that no longer owns the IP since the ARP update happens. The ask is to make the timeout/retry values configurable via procfs or sysfs. This will allow for environments that use RoCE to reduce the timeouts to a more reasonable values and be able to react to the ARP updates faster. Other CMA users (eg IB or others) can continue to use existing values. The problem exist in all kernel versions but bugzilla is filed for 5.14 kernel. The use case is (RoCE-based) NFSoRDMA where a server went down and another server was brought up in its place. RDMA layer introduces 4+ minutes in being able to re-establish an RDMA connection and let IO resume, due to inability to react to the ARP update. -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.