On Fri, Jun 26, 2020 at 08:56:41AM -0400, Chuck Lever wrote: > > On Jun 26, 2020, at 3:10 AM, Dan Aloni <dan@xxxxxxxxxxxx> wrote: [..] > > - Add a mutex in `rpcrdma_ep_destroy` to guard against concurrent calls > > to `rpcrdma_xprt_disconnect` coming from either `rpcrdma_xprt_connect` > > or `xprt_rdma_close`. > > NAK. The RPC client provides appropriate exclusion, please let's not > add more serialization that can introduce further deadlocks. It appeared to me that this exclusion does not works well. As for my considerations, if I am not mistaken from analyzing crashes I've seen: -> xprt_autoclose (running on xprtiod) -> xprt->ops->close -> xprt_rdma_close -> rpcrdma_xprt_disconnect and: -> xprt_rdma_connect_worker (running on xprtiod) -> rpcrdma_xprt_connect -> rpcrdma_xprt_disconnect I understand the rationale or at least the aim that `close` and `connect` ops should not be concurrent on the same `xprt`, however: * `xprt_force_disconnect`, is called from various places, queues a call to `xprt_autoclose` to the background on `xprtiod` workqueue item, conditioned that `!XPRT_LOCKED` which is the case for connect that went to the background. * `xprt_rdma_connect` also sends `xprt_rdma_connect_worker` as an `xprtiod` workqueue item, unconditionally. So we have two work items that can run in parallel, and I don't see clear gating on this from the code. Maybe there's a simpler fix for this. Perhaps a `cancel_delayed_work_sync(&r_xprt->rx_connect_worker);` is appropriate in `xprt_rdma_close`? -- Dan Aloni