On Sun, May 05, 2024 at 09:36:28PM +0300, Dan Aloni wrote: > On 2024-05-05 10:35:46, Chuck Lever wrote: > > On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote: > > > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount > > > underflow. > > > > > > The specific scenario that caused this to happen is IB device bonding, > > > when bringing down one of the ports, or all ports. The situation is not > > > just a print - it also causes a non-recoverable state it is not even > > > possible to complete the disconnect and shut it down the mount, > > > requiring a reboot, suggesting that tear-down is also incomplete in this > > > state. > > > > > > The trivial fix seems to work as such - if we did not receive a > > > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise > > > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of > > > `rpcrdma_xprt_disconnect`. > > > > > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed') > > > > Hi Dan, thanks for the bug report! > > > [..] > > Second, I wonder if, when bonding interfaces, there's an opportunity > > for the verbs consumer to take an additional transport reference. > > Cc'ing linux-rdma for input on that issue. Can you show a brief > > diagram of when the ep reference count changes when bonding? > > Not sure we need an additional reference here. We already have two mechanisms in play: - the ep reference count - the re_connect_status value I would prefer not to add a boolean here. Seems like re_connect_status could do that job if we need it to. I seem to recall that when a device is removed, the verbs consumer doesn't get a DISCONNECT first. Or does it? Having a sequence of CM events that you see in the bonded case would help a lot. > I understand regarding rpcrdma > managing its internal EP refcount, that it is having three in total: one for > init, another one for ESTABLISHED mode, and a third for 'outstanding receives', > so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me, > and I found some more about it, please read on. > > > Also, I note that the WARNING below happened on a RHEL9 kernel: > > > > 5.14.0-284.11.1.el9_2.x86_64 > > > > Can you confirm that this issue reproduces on v6.8 and newer ? > > Some context: I originally tested with a version that also has implemented a > timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat > 'culprit' in the negative decref, but there is the other issue and second order > effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes > the unkillable transport, so we are not reaching teardown at all. With both of > these changes applied both problems are gone. > > As for upstream version testing, with the elrepo build of 6.8.9 which matches > vanilla, I don't see the negative kref but I do see the other effect of > `XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below), > which does not release even after ports are back up online. > > Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);` > for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery > on bonding, I'll post in a reply. It looks like you are trying to fix several issues in one patch. So I need you to separate these issues and the fixes, and let's focus on the upstream kernel (v6.9-rc6) because there's nothing I can do about the RHEL9 kernel, which is clearly a different source base than the one I work on. If we need a "wake_up_all(&ep->re_connect_wait);" during DEVICE_REMOVAL, that should be a separate patch. And you need to figure out if ADDR_CHANGE needs the same treatment: the v2 you just sent changes the behavior of ADDR_CHANGE too but does not mention whether it intended that change. Without a reproducer or a detailed explanation of how the ep reference count changes in step with CM events, I can't properly assess your proposed fix. > ``` > INFO: task kworker/u128:6:1688 blocked for more than 122 seconds. > Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/u128:6 state:D stack:0 pid:1688 tgid:1688 ppid:2 flags:0x00004000 > Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core] > Call Trace: > <TASK> > __schedule+0x21c/0x680 > schedule+0x31/0xd0 > schedule_timeout+0x148/0x160 > ? mutex_lock+0xe/0x30 > ? cma_process_remove+0x218/0x260 [rdma_cm] > ? preempt_count_add+0x70/0xa0 > __wait_for_common+0x90/0x1e0 > ? __pfx_schedule_timeout+0x10/0x10 > cma_remove_one+0x5c/0xd0 [rdma_cm] > remove_client_context+0x88/0xd0 [ib_core] > disable_device+0x8a/0x160 [ib_core] > ? _raw_spin_unlock+0x15/0x30 > __ib_unregister_device+0x42/0xb0 [ib_core] > ib_unregister_device+0x22/0x30 [ib_core] > mlx5r_remove+0x36/0x60 [mlx5_ib] > auxiliary_bus_remove+0x18/0x30 > device_release_driver_internal+0x193/0x200 > bus_remove_device+0xbf/0x130 > device_del+0x157/0x3e0 > ? devl_param_driverinit_value_get+0x29/0xa0 > mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core] > mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core] > mlx5_do_bond+0x2cb/0x390 [mlx5_core] > mlx5_do_bond_work+0x96/0xf0 [mlx5_core] > process_one_work+0x174/0x340 > worker_thread+0x27e/0x390 > ? __pfx_worker_thread+0x10/0x10 > kthread+0xee/0x120 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x2d/0x50 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1b/0x30 > </TASK> > INFO: task fio:10584 blocked for more than 122 seconds. > Tainted: G E 6.8.9-1.el9.elrepo.x86_64 #1 > ``` > > -- > Dan Aloni -- Chuck Lever