Re: [PATCH v1] xprtrdma: Fix corner cases when handling device removal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Mar 19, 2018, at 1:26 PM, Kalderon, Michal <Michal.Kalderon@xxxxxxxxxx> wrote:
> 
>> From: Kalderon, Michal
>> Sent: Monday, March 19, 2018 7:22 PM
>> To: 'Chuck Lever' <chuck.lever@xxxxxxxxxx>
>> Cc: linux-rdma@xxxxxxxxxxxxxxx
>> Subject: RE: [PATCH v1] xprtrdma: Fix corner cases when handling device
>> removal
>> 
>>> From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx]
>>> Sent: Monday, March 19, 2018 4:59 PM
>>> To: Kalderon, Michal <Michal.Kalderon@xxxxxxxxxx>
>>> Cc: linux-rdma@xxxxxxxxxxxxxxx
>>> Subject: Re: [PATCH v1] xprtrdma: Fix corner cases when handling
>>> device removal
>>> 
>>> Hi Michal-
>>> 
>>> Have you tried this one? If we can nail this down now, I can easily
>>> get it into the v4.17 merge window.
>> 
>> Yes, looks good thanks.
> 
> Chuck, while testing changes, I ran into a different issue though. 
> If I perform mount to a server that does not have the nfs rdma port in portlist using mount -o rdma,
> And then  perform rmmod qedr, it hangs with following stack : 
> 
> root@lb-tlvb-pcie37 ~]# cat /proc/5981/stack 
> [<0>] rpcrdma_conn_upcall+0x26a/0x310 [rpcrdma]
> [<0>] cma_remove_one+0x26d/0x2a0 [rdma_cm]
> [<0>] ib_unregister_device+0xcc/0x180 [ib_core]
> [<0>] qedr_remove+0x37/0x80 [qedr]
> [<0>] qede_rdma_unregister_driver+0x4b/0x90 [qede]
> [<0>] SyS_delete_module+0x1c2/0x240
> [<0>] do_syscall_64+0x6e/0x190
> [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 
> (wait_for_completion(&ia->ri_remove_done)

I will try to reproduce it here.


> Thanks,
> Michal
>>> 
>>> 
>>>> On Mar 14, 2018, at 10:42 AM, Chuck Lever <chuck.lever@xxxxxxxxxx>
>>> wrote:
>>>> 
>>>> Michal Kalderon has found some corner cases around device unload
>>>> with active NFS mounts that I didn't have the imagination to test
>>>> when xprtrdma device removal was added last year.
>>>> 
>>>> - The ULP device removal handler is responsible for deallocating
>>>> the PD. That wasn't clear to me initially, and my own testing
>>>> suggested it was not necessary, but that is incorrect.
>>>> 
>>>> - The transport destruction path can no longer assume that there  is
>>>> a valid ID.
>>>> 
>>>> - When destroying a transport, ensure that ib_free_cq() is not
>>>> invoked on a CQ that was already released.
>>>> 
>>>> Reported-by: Michal Kalderon <Michal.Kalderon@xxxxxxxxxx>
>>>> Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA from ...")
>>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>> Cc: stable@xxxxxxxxxxxxxxx
>>>> ---
>>>> net/sunrpc/xprtrdma/verbs.c |   13 +++++++++----
>>>> 1 file changed, 9 insertions(+), 4 deletions(-)
>>>> 
>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c
>>>> b/net/sunrpc/xprtrdma/verbs.c index d19ea02..c284ee7 100644
>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>> @@ -251,7 +251,6 @@
>>>> 		wait_for_completion(&ia->ri_remove_done);
>>>> 
>>>> 		ia->ri_id = NULL;
>>>> -		ia->ri_pd = NULL;
>>>> 		ia->ri_device = NULL;
>>>> 		/* Return 1 to ensure the core destroys the id. */
>>>> 		return 1;
>>>> @@ -449,7 +448,9 @@
>>>> 		ia->ri_id->qp = NULL;
>>>> 	}
>>>> 	ib_free_cq(ep->rep_attr.recv_cq);
>>>> +	ep->rep_attr.recv_cq = NULL;
>>>> 	ib_free_cq(ep->rep_attr.send_cq);
>>>> +	ep->rep_attr.send_cq = NULL;
>>>> 
>>>> 	/* The ULP is responsible for ensuring all DMA
>>>> 	 * mappings and MRs are gone.
>>>> @@ -462,6 +463,8 @@
>>>> 		rpcrdma_dma_unmap_regbuf(req->rl_recvbuf);
>>>> 	}
>>>> 	rpcrdma_mrs_destroy(buf);
>>>> +	ib_dealloc_pd(ia->ri_pd);
>>>> +	ia->ri_pd = NULL;
>>>> 
>>>> 	/* Allow waiters to continue */
>>>> 	complete(&ia->ri_remove_done);
>>>> @@ -629,14 +632,16 @@
>>>> {
>>>> 	cancel_delayed_work_sync(&ep->rep_connect_worker);
>>>> 
>>>> -	if (ia->ri_id->qp) {
>>>> +	if (ia->ri_id && ia->ri_id->qp) {
>>>> 		rpcrdma_ep_disconnect(ep, ia);
>>>> 		rdma_destroy_qp(ia->ri_id);
>>>> 		ia->ri_id->qp = NULL;
>>>> 	}
>>>> 
>>>> -	ib_free_cq(ep->rep_attr.recv_cq);
>>>> -	ib_free_cq(ep->rep_attr.send_cq);
>>>> +	if (ep->rep_attr.recv_cq)
>>>> +		ib_free_cq(ep->rep_attr.recv_cq);
>>>> +	if (ep->rep_attr.send_cq)
>>>> +		ib_free_cq(ep->rep_attr.send_cq);
>>>> }
>>>> 
>>>> /* Re-establish a connection after a device removal event.
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
>>> majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>> 
>>> --
>>> Chuck Lever
>>> 
>>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux