On 2024-05-05 10:35:46, Chuck Lever wrote: > On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote: > > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount > > underflow. > > > > The specific scenario that caused this to happen is IB device bonding, > > when bringing down one of the ports, or all ports. The situation is not > > just a print - it also causes a non-recoverable state it is not even > > possible to complete the disconnect and shut it down the mount, > > requiring a reboot, suggesting that tear-down is also incomplete in this > > state. > > > > The trivial fix seems to work as such - if we did not receive a > > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise > > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of > > `rpcrdma_xprt_disconnect`. > > > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed') > > Hi Dan, thanks for the bug report! > [..] > Second, I wonder if, when bonding interfaces, there's an opportunity > for the verbs consumer to take an additional transport reference. > Cc'ing linux-rdma for input on that issue. Can you show a brief > diagram of when the ep reference count changes when bonding? Not sure we need an additional reference here. I understand regarding rpcrdma managing its internal EP refcount, that it is having three in total: one for init, another one for ESTABLISHED mode, and a third for 'outstanding receives', so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me, and I found some more about it, please read on. > Also, I note that the WARNING below happened on a RHEL9 kernel: > > 5.14.0-284.11.1.el9_2.x86_64 > > Can you confirm that this issue reproduces on v6.8 and newer ? Some context: I originally tested with a version that also has implemented a timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat 'culprit' in the negative decref, but there is the other issue and second order effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes the unkillable transport, so we are not reaching teardown at all. With both of these changes applied both problems are gone. As for upstream version testing, with the elrepo build of 6.8.9 which matches vanilla, I don't see the negative kref but I do see the other effect of `XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below), which does not release even after ports are back up online. Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);` for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery on bonding, I'll post in a reply. ``` INFO: task kworker/u128:6:1688 blocked for more than 122 seconds. Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u128:6 state:D stack:0 pid:1688 tgid:1688 ppid:2 flags:0x00004000 Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core] Call Trace: <TASK> __schedule+0x21c/0x680 schedule+0x31/0xd0 schedule_timeout+0x148/0x160 ? mutex_lock+0xe/0x30 ? cma_process_remove+0x218/0x260 [rdma_cm] ? preempt_count_add+0x70/0xa0 __wait_for_common+0x90/0x1e0 ? __pfx_schedule_timeout+0x10/0x10 cma_remove_one+0x5c/0xd0 [rdma_cm] remove_client_context+0x88/0xd0 [ib_core] disable_device+0x8a/0x160 [ib_core] ? _raw_spin_unlock+0x15/0x30 __ib_unregister_device+0x42/0xb0 [ib_core] ib_unregister_device+0x22/0x30 [ib_core] mlx5r_remove+0x36/0x60 [mlx5_ib] auxiliary_bus_remove+0x18/0x30 device_release_driver_internal+0x193/0x200 bus_remove_device+0xbf/0x130 device_del+0x157/0x3e0 ? devl_param_driverinit_value_get+0x29/0xa0 mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core] mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core] mlx5_do_bond+0x2cb/0x390 [mlx5_core] mlx5_do_bond_work+0x96/0xf0 [mlx5_core] process_one_work+0x174/0x340 worker_thread+0x27e/0x390 ? __pfx_worker_thread+0x10/0x10 kthread+0xee/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2d/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> INFO: task fio:10584 blocked for more than 122 seconds. Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1 ``` -- Dan Aloni