Re: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen

Chuck Lever <chuck.lever@xxxxxxxxxx> · Sun, 5 May 2024 15:00:10 -0400

On Sun, May 05, 2024 at 09:36:28PM +0300, Dan Aloni wrote:
> On 2024-05-05 10:35:46, Chuck Lever wrote:
> > On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> > > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> > > underflow.
> > > 
> > > The specific scenario that caused this to happen is IB device bonding,
> > > when bringing down one of the ports, or all ports. The situation is not
> > > just a print - it also causes a non-recoverable state it is not even
> > > possible to complete the disconnect and shut it down the mount,
> > > requiring a reboot, suggesting that tear-down is also incomplete in this
> > > state.
> > > 
> > > The trivial fix seems to work as such - if we did not receive a
> > > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> > > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> > > `rpcrdma_xprt_disconnect`.
> > > 
> > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
> > 
> > Hi Dan, thanks for the bug report!
> > 
> [..]
> > Second, I wonder if, when bonding interfaces, there's an opportunity
> > for the verbs consumer to take an additional transport reference.
> > Cc'ing linux-rdma for input on that issue. Can you show a brief
> > diagram of when the ep reference count changes when bonding?
> 
> Not sure we need an additional reference here.

We already have two mechanisms in play:

- the ep reference count
- the re_connect_status value

I would prefer not to add a boolean here. Seems like
re_connect_status could do that job if we need it to.

I seem to recall that when a device is removed, the verbs consumer
doesn't get a DISCONNECT first. Or does it? Having a sequence of
CM events that you see in the bonded case would help a lot.

> I understand regarding rpcrdma
> managing its internal EP refcount, that it is having three in total: one for
> init, another one for ESTABLISHED mode, and a third for 'outstanding receives',
> so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me,
> and I found some more about it, please read on.
> 
> > Also, I note that the WARNING below happened on a RHEL9 kernel:
> > 
> >    5.14.0-284.11.1.el9_2.x86_64
> > 
> > Can you confirm that this issue reproduces on v6.8 and newer ?
> 
> Some context: I originally tested with a version that also has implemented a
> timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat
> 'culprit' in the negative decref, but there is the other issue and second order
> effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes
> the unkillable transport, so we are not reaching teardown at all. With both of
> these changes applied both problems are gone.
> 
> As for upstream version testing, with the elrepo build of 6.8.9 which matches
> vanilla, I don't see the negative kref but I do see the other effect of
> `XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below),
> which does not release even after ports are back up online.
> 
> Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
> for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
> on bonding, I'll post in a reply.

It looks like you are trying to fix several issues in one patch. So
I need you to separate these issues and the fixes, and let's focus
on the upstream kernel (v6.9-rc6) because there's nothing I can do
about the RHEL9 kernel, which is clearly a different source base
than the one I work on.

If we need a "wake_up_all(&ep->re_connect_wait);" during
DEVICE_REMOVAL, that should be a separate patch. And you need to
figure out if ADDR_CHANGE needs the same treatment: the v2 you just
sent changes the behavior of ADDR_CHANGE too but does not mention
whether it intended that change.

Without a reproducer or a detailed explanation of how the ep
reference count changes in step with CM events, I can't properly
assess your proposed fix.

> ```
> INFO: task kworker/u128:6:1688 blocked for more than 122 seconds.
>       Tainted: G            E      6.8.9-1.el9.elrepo.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u128:6  state:D stack:0     pid:1688  tgid:1688  ppid:2      flags:0x00004000
> Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
> Call Trace:
>  <TASK>
>  __schedule+0x21c/0x680
>  schedule+0x31/0xd0
>  schedule_timeout+0x148/0x160
>  ? mutex_lock+0xe/0x30
>  ? cma_process_remove+0x218/0x260 [rdma_cm]
>  ? preempt_count_add+0x70/0xa0
>  __wait_for_common+0x90/0x1e0
>  ? __pfx_schedule_timeout+0x10/0x10
>  cma_remove_one+0x5c/0xd0 [rdma_cm]
>  remove_client_context+0x88/0xd0 [ib_core]
>  disable_device+0x8a/0x160 [ib_core]
>  ? _raw_spin_unlock+0x15/0x30
>  __ib_unregister_device+0x42/0xb0 [ib_core]
>  ib_unregister_device+0x22/0x30 [ib_core]
>  mlx5r_remove+0x36/0x60 [mlx5_ib]
>  auxiliary_bus_remove+0x18/0x30
>  device_release_driver_internal+0x193/0x200
>  bus_remove_device+0xbf/0x130
>  device_del+0x157/0x3e0
>  ? devl_param_driverinit_value_get+0x29/0xa0
>  mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core]
>  mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core]
>  mlx5_do_bond+0x2cb/0x390 [mlx5_core]
>  mlx5_do_bond_work+0x96/0xf0 [mlx5_core]
>  process_one_work+0x174/0x340
>  worker_thread+0x27e/0x390
>  ? __pfx_worker_thread+0x10/0x10
>  kthread+0xee/0x120
>  ? __pfx_kthread+0x10/0x10
>  ret_from_fork+0x2d/0x50
>  ? __pfx_kthread+0x10/0x10
>  ret_from_fork_asm+0x1b/0x30
>  </TASK>
> INFO: task fio:10584 blocked for more than 122 seconds.
>       Tainted: G            E      6.8.9-1.el9.elrepo.x86_64 #1
> ```
> 
> -- 
> Dan Aloni

-- 
Chuck Lever