> -----Original Message----- > From: Steve Wise [mailto:swise@xxxxxxxxxxxxxxxxxxxxx] > Sent: Monday, July 21, 2014 8:53 PM > To: 'Chuck Lever' > Cc: Devesh Sharma; 'Shirley Ma'; 'Hefty, Sean'; 'Roland Dreier'; linux- > rdma@xxxxxxxxxxxxxxx > Subject: RE: [for-next 1/2] xprtrdma: take reference of rdma provider > module > > > > > -----Original Message----- > > From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx] > > Sent: Monday, July 21, 2014 10:21 AM > > To: Steve Wise > > Cc: Devesh Sharma; Shirley Ma; Hefty, Sean; Roland Dreier; > > linux-rdma@xxxxxxxxxxxxxxx > > Subject: Re: [for-next 1/2] xprtrdma: take reference of rdma provider > > module > > > > > > On Jul 21, 2014, at 11:03 AM, Steve Wise <swise@xxxxxxxxxxxxxxxxxxxxx> > wrote: > > > > > > > > > > >> -----Original Message----- > > >> From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx] > > >> Sent: Monday, July 21, 2014 9:54 AM > > >> To: Devesh Sharma > > >> Cc: Shirley Ma; Steve Wise; Hefty, Sean; Roland Dreier; > > >> linux-rdma@xxxxxxxxxxxxxxx > > >> Subject: Re: [for-next 1/2] xprtrdma: take reference of rdma > > >> provider module > > >> > > >> Hi Devesh- > > >> > > >> Thanks for drilling into this further. > > >> > > >> On Jul 21, 2014, at 7:48 AM, Devesh Sharma > <Devesh.Sharma@xxxxxxxxxx> wrote: > > >> > > >>> In rpcrdma_ep_connect(): > > >>> > > >>> write_lock(&ia->ri_qplock); > > >>> old = ia->ri_id; > > >>> ia->ri_id = id; > > >>> write_unlock(&ia->ri_qplock); > > >>> > > >>> rdma_destroy_qp(old); > > >>> rdma_destroy_id(old); =============> Cm -id is destroyed > here. > > >>> > > >>> > > >>> If following code fails in rpcrdma_ep_connect(): > > >>> id = rpcrdma_create_id(xprt, ia, > > >>> (struct sockaddr *)&xprt->rx_data.addr); > > >>> if (IS_ERR(id)) { > > >>> rc = -EHOSTUNREACH; > > >>> goto out; > > >>> } > > >>> > > >>> it leaves old cm-id still alive. This will always fail if Device > > >>> is removed > abruptly. > > >> > > >> For CM_EVENT_DEVICE_REMOVAL, rpcrdma_conn_upcall() sets > > >> ep->rep_connected to -ENODEV. > > >> > > >> Then: > > >> > > >> 929 int > > >> 930 rpcrdma_ep_connect(struct rpcrdma_ep *ep, struct rpcrdma_ia > > >> *ia) > > >> 931 { > > >> 932 struct rdma_cm_id *id, *old; > > >> 933 int rc = 0; > > >> 934 int retry_count = 0; > > >> 935 > > >> 936 if (ep->rep_connected != 0) { > > >> 937 struct rpcrdma_xprt *xprt; > > >> 938 retry: > > >> 939 dprintk("RPC: %s: reconnecting...\n", __func__); > > >> > > >> ep->rep_connected is probably -ENODEV after a device removal. It > > >> ep->would be > > >> possible for the connect worker to destroy everything associated > > >> with this connection in that case to ensure the underlying object > > >> reference counts are cleared. > > >> > > >> The immediate danger is that if there are pending RPCs, they could > > >> exit while qp/cm_id are NULL, triggering a panic in > rpcrdma_deregister_frmr_external(). > > >> Checking for NULL pointers inside the ri_qplock would prevent that. > > >> > > >> However, NFS mounts via this adapter will hang indefinitely after > > >> all transports are torn down and the adapter is gone. The only > > >> thing that can be done is something drastic like "echo b > > /proc/sysrq_trigger" on the client. > > >> > > >> Thus, IMO hot-plugging or passive fail-over are the only scenarios > > >> where this makes sense. If we have an immediate problem here, is it > > >> a problem with system shutdown ordering that can be addressed in > some other way? > > >> > > >> Until that support is in place, obviously I would prefer that the > > >> removal of the underlying driver be prevented while there are NFS > > >> mounts in place. I think that's what NFS users have come to expect. > > >> > > >> In other words, don't allow device removal until we have support > > >> for device insertion :-) This needs a fresh series of patches? > > >> > > >> > > > > > > > > > If we fix the above problems on provider unload, shouldn't the mount > > > recover if the provider module is subsequently loaded? Or another > > > provider configured such that > > > rdma_resolve_addr/route() then picks an active device? > > > > On device removal, we have to destroy everything. > > > > On insertion, we'll have to create a fresh PD and MRs, which isn't > > done now during reconnect. So, insertion needs work too. Makes Sense to me too. Thanks Chuck. -Regards Devesh > > > > Got it, thanks! > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html