On Mon, 2024-05-06 at 16:06 -0400, Chuck Lever wrote: > On Mon, May 06, 2024 at 12:37:59PM +0300, Dan Aloni wrote: > > Under the scenario of IB device bonding, when bringing down one of > > the > > ports, or all ports, we saw xprtrdma entering a non-recoverable > > state > > where it is not even possible to complete the disconnect and shut > > it > > down the mount, requiring a reboot. Following debug, we saw that > > transport connect never ended after receiving the > > RDMA_CM_EVENT_DEVICE_REMOVAL callback. > > > > The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is > > connected, and ESTABLISHED may not have happened. So need to work > > with > > each of these states accordingly. > > > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep > > after it is freed') > > Cc: Sagi Grimberg <sagi.grimberg@xxxxxxxxxxxx> > > Signed-off-by: Dan Aloni <dan.aloni@xxxxxxxxxxxx> > > --- > > net/sunrpc/xprtrdma/verbs.c | 6 +++++- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/net/sunrpc/xprtrdma/verbs.c > > b/net/sunrpc/xprtrdma/verbs.c > > index 4f8d7efa469f..432557a553e7 100644 > > --- a/net/sunrpc/xprtrdma/verbs.c > > +++ b/net/sunrpc/xprtrdma/verbs.c > > @@ -244,7 +244,11 @@ rpcrdma_cm_event_handler(struct rdma_cm_id > > *id, struct rdma_cm_event *event) > > case RDMA_CM_EVENT_DEVICE_REMOVAL: > > pr_info("rpcrdma: removing device %s for > > %pISpc\n", > > ep->re_id->device->name, sap); > > - fallthrough; > > + switch (xchg(&ep->re_connect_status, -ENODEV)) { > > + case 0: goto wake_connect_worker; > > + case 1: goto disconnected; > > + } > > + return 0; > > case RDMA_CM_EVENT_ADDR_CHANGE: > > ep->re_connect_status = -ENODEV; > > goto disconnected; > > -- > > 2.39.3 > > > > Hi Anna, > > Please apply this patch with: > > Reviewed-by: Sagi Grimberg <sagi@xxxxxxxxxxx> > Reviewed-by: Chuck Lever <chuck.lever@xxxxxxxxxx> > > Anna is back on leave for a few weeks, so I'll take it. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx