RE: [PATCH v1] xprtrdma: Fix corner cases when handling device removal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx]
> Sent: Tuesday, March 20, 2018 2:40 AM
> > On Mar 19, 2018, at 2:06 PM, Chuck Lever <chuck.lever@xxxxxxxxxx>
> wrote:
> >> On Mar 19, 2018, at 1:26 PM, Kalderon, Michal
> <Michal.Kalderon@xxxxxxxxxx> wrote:
> >>
> >>> From: Kalderon, Michal
> >>> Sent: Monday, March 19, 2018 7:22 PM
> >>> To: 'Chuck Lever' <chuck.lever@xxxxxxxxxx>
> >>> Cc: linux-rdma@xxxxxxxxxxxxxxx
> >>> Subject: RE: [PATCH v1] xprtrdma: Fix corner cases when handling
> >>> device removal
> >>>
> >>>> From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx]
> >>>> Sent: Monday, March 19, 2018 4:59 PM
> >>>> To: Kalderon, Michal <Michal.Kalderon@xxxxxxxxxx>
> >>>> Cc: linux-rdma@xxxxxxxxxxxxxxx
> >>>> Subject: Re: [PATCH v1] xprtrdma: Fix corner cases when handling
> >>>> device removal
> >>>>
> >>>> Hi Michal-
> >>>>
> >>>> Have you tried this one? If we can nail this down now, I can easily
> >>>> get it into the v4.17 merge window.
> >>>
> >>> Yes, looks good thanks.
> >>
> >> Chuck, while testing changes, I ran into a different issue though.
> >> If I perform mount to a server that does not have the nfs rdma port
> >> in portlist using mount -o rdma, And then  perform rmmod qedr, it hangs
> with following stack :
> >>
> >> root@lb-tlvb-pcie37 ~]# cat /proc/5981/stack [<0>]
> >> rpcrdma_conn_upcall+0x26a/0x310 [rpcrdma] [<0>]
> >> cma_remove_one+0x26d/0x2a0 [rdma_cm] [<0>]
> >> ib_unregister_device+0xcc/0x180 [ib_core] [<0>]
> qedr_remove+0x37/0x80
> >> [qedr] [<0>] qede_rdma_unregister_driver+0x4b/0x90 [qede] [<0>]
> >> SyS_delete_module+0x1c2/0x240 [<0>] do_syscall_64+0x6e/0x190 [<0>]
> >> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> >>
> >> (wait_for_completion(&ia->ri_remove_done)
> >
> > I will try to reproduce it here.
> 
> I can't reproduce with my IB or my RoCE rig, but I might not have understood
> how you are getting into this state. I'm trying this:
> 
> # mount.nfs -v -o vers=4.0,proto=rdma,port=12345,sec=sys nfs-
> roce:/export/tmp /mnt

I killed the mount attempt with "Ctrl-C" and performed the rmmod from the same
window

> 
> and while it is looping trying to connect, I do this in another
> window:
> 
> # rmmod mlx4_en
> 
> On it's next attempt to connect, the connect worker fails with
> RDMA_CM_EVENT_ADDR_ERROR. I don't ever see
> RDMA_CM_EVENT_DEVICE_REMOVAL.
> 
> 
> >> Thanks,
> >> Michal
> >>>>
> >>>>
> >>>>> On Mar 14, 2018, at 10:42 AM, Chuck Lever <chuck.lever@xxxxxxxxxx>
> >>>> wrote:
> >>>>>
> >>>>> Michal Kalderon has found some corner cases around device unload
> >>>>> with active NFS mounts that I didn't have the imagination to test
> >>>>> when xprtrdma device removal was added last year.
> >>>>>
> >>>>> - The ULP device removal handler is responsible for deallocating
> >>>>> the PD. That wasn't clear to me initially, and my own testing
> >>>>> suggested it was not necessary, but that is incorrect.
> >>>>>
> >>>>> - The transport destruction path can no longer assume that there
> >>>>> is a valid ID.
> >>>>>
> >>>>> - When destroying a transport, ensure that ib_free_cq() is not
> >>>>> invoked on a CQ that was already released.
> >>>>>
> >>>>> Reported-by: Michal Kalderon <Michal.Kalderon@xxxxxxxxxx>
> >>>>> Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA from
> >>>>> ...")
> >>>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> >>>>> Cc: stable@xxxxxxxxxxxxxxx
> >>>>> ---
> >>>>> net/sunrpc/xprtrdma/verbs.c |   13 +++++++++----
> >>>>> 1 file changed, 9 insertions(+), 4 deletions(-)
> >>>>>
> >>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c
> >>>>> b/net/sunrpc/xprtrdma/verbs.c index d19ea02..c284ee7 100644
> >>>>> --- a/net/sunrpc/xprtrdma/verbs.c
> >>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
> >>>>> @@ -251,7 +251,6 @@
> >>>>> 		wait_for_completion(&ia->ri_remove_done);
> >>>>>
> >>>>> 		ia->ri_id = NULL;
> >>>>> -		ia->ri_pd = NULL;
> >>>>> 		ia->ri_device = NULL;
> >>>>> 		/* Return 1 to ensure the core destroys the id. */
> >>>>> 		return 1;
> >>>>> @@ -449,7 +448,9 @@
> >>>>> 		ia->ri_id->qp = NULL;
> >>>>> 	}
> >>>>> 	ib_free_cq(ep->rep_attr.recv_cq);
> >>>>> +	ep->rep_attr.recv_cq = NULL;
> >>>>> 	ib_free_cq(ep->rep_attr.send_cq);
> >>>>> +	ep->rep_attr.send_cq = NULL;
> >>>>>
> >>>>> 	/* The ULP is responsible for ensuring all DMA
> >>>>> 	 * mappings and MRs are gone.
> >>>>> @@ -462,6 +463,8 @@
> >>>>> 		rpcrdma_dma_unmap_regbuf(req->rl_recvbuf);
> >>>>> 	}
> >>>>> 	rpcrdma_mrs_destroy(buf);
> >>>>> +	ib_dealloc_pd(ia->ri_pd);
> >>>>> +	ia->ri_pd = NULL;
> >>>>>
> >>>>> 	/* Allow waiters to continue */
> >>>>> 	complete(&ia->ri_remove_done);
> >>>>> @@ -629,14 +632,16 @@
> >>>>> {
> >>>>> 	cancel_delayed_work_sync(&ep->rep_connect_worker);
> >>>>>
> >>>>> -	if (ia->ri_id->qp) {
> >>>>> +	if (ia->ri_id && ia->ri_id->qp) {
> >>>>> 		rpcrdma_ep_disconnect(ep, ia);
> >>>>> 		rdma_destroy_qp(ia->ri_id);
> >>>>> 		ia->ri_id->qp = NULL;
> >>>>> 	}
> >>>>>
> >>>>> -	ib_free_cq(ep->rep_attr.recv_cq);
> >>>>> -	ib_free_cq(ep->rep_attr.send_cq);
> >>>>> +	if (ep->rep_attr.recv_cq)
> >>>>> +		ib_free_cq(ep->rep_attr.recv_cq);
> >>>>> +	if (ep->rep_attr.send_cq)
> >>>>> +		ib_free_cq(ep->rep_attr.send_cq);
> >>>>> }
> >>>>>
> >>>>> /* Re-establish a connection after a device removal event.
> >>>>>
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> >>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> >>>> majordomo
> >>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>> --
> >>>> Chuck Lever
> >>>>
> >>>>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> >
> > --
> > Chuck Lever
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Chuck Lever
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux