Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen

Dan Aloni <dan.aloni@xxxxxxxxxxxx> · Sun, 5 May 2024 21:36:28 +0300

On 2024-05-05 10:35:46, Chuck Lever wrote:
> On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> > underflow.
> > 
> > The specific scenario that caused this to happen is IB device bonding,
> > when bringing down one of the ports, or all ports. The situation is not
> > just a print - it also causes a non-recoverable state it is not even
> > possible to complete the disconnect and shut it down the mount,
> > requiring a reboot, suggesting that tear-down is also incomplete in this
> > state.
> > 
> > The trivial fix seems to work as such - if we did not receive a
> > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> > `rpcrdma_xprt_disconnect`.
> > 
> > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
> 
> Hi Dan, thanks for the bug report!
> 
[..]
> Second, I wonder if, when bonding interfaces, there's an opportunity
> for the verbs consumer to take an additional transport reference.
> Cc'ing linux-rdma for input on that issue. Can you show a brief
> diagram of when the ep reference count changes when bonding?

Not sure we need an additional reference here. I understand regarding rpcrdma
managing its internal EP refcount, that it is having three in total: one for
init, another one for ESTABLISHED mode, and a third for 'outstanding receives',
so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me,
and I found some more about it, please read on.

> Also, I note that the WARNING below happened on a RHEL9 kernel:
> 
>    5.14.0-284.11.1.el9_2.x86_64
> 
> Can you confirm that this issue reproduces on v6.8 and newer ?

Some context: I originally tested with a version that also has implemented a
timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat
'culprit' in the negative decref, but there is the other issue and second order
effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes
the unkillable transport, so we are not reaching teardown at all. With both of
these changes applied both problems are gone.

As for upstream version testing, with the elrepo build of 6.8.9 which matches
vanilla, I don't see the negative kref but I do see the other effect of
`XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below),
which does not release even after ports are back up online.

Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
on bonding, I'll post in a reply.

```
INFO: task kworker/u128:6:1688 blocked for more than 122 seconds.
      Tainted: G            E      6.8.9-1.el9.elrepo.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u128:6  state:D stack:0     pid:1688  tgid:1688  ppid:2      flags:0x00004000
Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
Call Trace:
 <TASK>
 __schedule+0x21c/0x680
 schedule+0x31/0xd0
 schedule_timeout+0x148/0x160
 ? mutex_lock+0xe/0x30
 ? cma_process_remove+0x218/0x260 [rdma_cm]
 ? preempt_count_add+0x70/0xa0
 __wait_for_common+0x90/0x1e0
 ? __pfx_schedule_timeout+0x10/0x10
 cma_remove_one+0x5c/0xd0 [rdma_cm]
 remove_client_context+0x88/0xd0 [ib_core]
 disable_device+0x8a/0x160 [ib_core]
 ? _raw_spin_unlock+0x15/0x30
 __ib_unregister_device+0x42/0xb0 [ib_core]
 ib_unregister_device+0x22/0x30 [ib_core]
 mlx5r_remove+0x36/0x60 [mlx5_ib]
 auxiliary_bus_remove+0x18/0x30
 device_release_driver_internal+0x193/0x200
 bus_remove_device+0xbf/0x130
 device_del+0x157/0x3e0
 ? devl_param_driverinit_value_get+0x29/0xa0
 mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core]
 mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core]
 mlx5_do_bond+0x2cb/0x390 [mlx5_core]
 mlx5_do_bond_work+0x96/0xf0 [mlx5_core]
 process_one_work+0x174/0x340
 worker_thread+0x27e/0x390
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xee/0x120
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2d/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>
INFO: task fio:10584 blocked for more than 122 seconds.
      Tainted: G            E      6.8.9-1.el9.elrepo.x86_64 #1
```

-- 
Dan Aloni