Re: [PATCH v1] xprtrdma: Fix corner cases when handling device removal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Mar 22, 2018, at 10:04 AM, Kalderon, Michal <Michal.Kalderon@xxxxxxxxxx> wrote:
> 
>> From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx]
>> Sent: Tuesday, March 20, 2018 4:15 PM
>> To: Kalderon, Michal <Michal.Kalderon@xxxxxxxxxx>
>> Cc: linux-rdma@xxxxxxxxxxxxxxx
>> Subject: Re: [PATCH v1] xprtrdma: Fix corner cases when handling device
>> removal
>> 
>> 
>> 
>>> On Mar 20, 2018, at 3:39 AM, Kalderon, Michal
>> <Michal.Kalderon@xxxxxxxxxx> wrote:
>>> 
>>>> From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx]
>>>> Sent: Tuesday, March 20, 2018 2:40 AM
>>>>> On Mar 19, 2018, at 2:06 PM, Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>> wrote:
>>>>>> On Mar 19, 2018, at 1:26 PM, Kalderon, Michal
>>>> <Michal.Kalderon@xxxxxxxxxx> wrote:
>>>>>> 
>>>>>>> From: Kalderon, Michal
>>>>>>> Sent: Monday, March 19, 2018 7:22 PM
>>>>>>> To: 'Chuck Lever' <chuck.lever@xxxxxxxxxx>
>>>>>>> Cc: linux-rdma@xxxxxxxxxxxxxxx
>>>>>>> Subject: RE: [PATCH v1] xprtrdma: Fix corner cases when handling
>>>>>>> device removal
>>>>>>> 
>>>>>>>> From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx]
>>>>>>>> Sent: Monday, March 19, 2018 4:59 PM
>>>>>>>> To: Kalderon, Michal <Michal.Kalderon@xxxxxxxxxx>
>>>>>>>> Cc: linux-rdma@xxxxxxxxxxxxxxx
>>>>>>>> Subject: Re: [PATCH v1] xprtrdma: Fix corner cases when handling
>>>>>>>> device removal
>>>>>>>> 
>>>>>>>> Hi Michal-
>>>>>>>> 
>>>>>>>> Have you tried this one? If we can nail this down now, I can
>>>>>>>> easily get it into the v4.17 merge window.
>>>>>>> 
>>>>>>> Yes, looks good thanks.
>>>>>> 
>>>>>> Chuck, while testing changes, I ran into a different issue though.
>>>>>> If I perform mount to a server that does not have the nfs rdma port
>>>>>> in portlist using mount -o rdma, And then  perform rmmod qedr, it
>>>>>> hangs
>>>> with following stack :
>>>>>> 
>>>>>> root@lb-tlvb-pcie37 ~]# cat /proc/5981/stack [<0>]
>>>>>> rpcrdma_conn_upcall+0x26a/0x310 [rpcrdma] [<0>]
>>>>>> cma_remove_one+0x26d/0x2a0 [rdma_cm] [<0>]
>>>>>> ib_unregister_device+0xcc/0x180 [ib_core] [<0>]
>>>> qedr_remove+0x37/0x80
>>>>>> [qedr] [<0>] qede_rdma_unregister_driver+0x4b/0x90 [qede] [<0>]
>>>>>> SyS_delete_module+0x1c2/0x240 [<0>] do_syscall_64+0x6e/0x190
>> [<0>]
>>>>>> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>>>>> 
>>>>>> (wait_for_completion(&ia->ri_remove_done)
>>>>> 
>>>>> I will try to reproduce it here.
>>>> 
>>>> I can't reproduce with my IB or my RoCE rig, but I might not have
>>>> understood how you are getting into this state. I'm trying this:
>>>> 
>>>> # mount.nfs -v -o vers=4.0,proto=rdma,port=12345,sec=sys nfs-
>>>> roce:/export/tmp /mnt
>>> 
>>> I killed the mount attempt with "Ctrl-C" and performed the rmmod from
>>> the same window
>> 
>> [root@manet ~]# mount.nfs -v -o vers=4.0,proto=rdma,port=12345,sec=sys
>> nfs-roce:/export/tmp /mnt
>> mount.nfs: timeout set for Tue Mar 20 10:12:03 2018
>> mount.nfs: trying text-based options
>> 'vers=4.0,proto=rdma,port=12345,sec=sys,addr=192.168.3.9,clientaddr=192.1
>> 68.3.2'
>> mount.nfs: mount(2): Connection refused
>> mount.nfs: trying text-based options
>> 'vers=4.0,proto=rdma,port=12345,sec=sys,addr=192.168.3.9,clientaddr=192.1
>> 68.3.2'
>> mount.nfs: mount(2): Connection refused
>> mount.nfs: trying text-based options
>> 'vers=4.0,proto=rdma,port=12345,sec=sys,addr=192.168.3.9,clientaddr=192.1
>> 68.3.2'
>> mount.nfs: mount(2): Connection refused
>> mount.nfs: trying text-based options
>> 'vers=4.0,proto=rdma,port=12345,sec=sys,addr=192.168.3.9,clientaddr=192.1
>> 68.3.2'
>> mount.nfs: mount(2): Connection refused
>> ^C
>> [root@manet ~]# rmmod mlx4_en
>> [root@manet ~]# modprobe mlx4_en; ifup ens6 [root@manet ~]#
>> 
>> No hang.
>> 
>> Can you try this:
>> 
>> # trace-cmd record -e sunrpc -e rpcrdma
> Haven't tried this yet, but just wanted to mention that I do ^C before getting connection refused

Fyi, I get the first "Connection refused" immediately.


> thanks
>> 
>> Let that run while you reproduce. Afterwards ^C this command and send me
>> the resulting trace.dat, off-list.
>> 
>> 
>>>> and while it is looping trying to connect, I do this in another
>>>> window:
>>>> 
>>>> # rmmod mlx4_en
>>>> 
>>>> On it's next attempt to connect, the connect worker fails with
>>>> RDMA_CM_EVENT_ADDR_ERROR. I don't ever see
>>>> RDMA_CM_EVENT_DEVICE_REMOVAL.
>>>> 
>>>> 
>>>>>> Thanks,
>>>>>> Michal
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Mar 14, 2018, at 10:42 AM, Chuck Lever
>>>>>>>>> <chuck.lever@xxxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Michal Kalderon has found some corner cases around device
>> unload
>>>>>>>>> with active NFS mounts that I didn't have the imagination to
>>>>>>>>> test when xprtrdma device removal was added last year.
>>>>>>>>> 
>>>>>>>>> - The ULP device removal handler is responsible for deallocating
>>>>>>>>> the PD. That wasn't clear to me initially, and my own testing
>>>>>>>>> suggested it was not necessary, but that is incorrect.
>>>>>>>>> 
>>>>>>>>> - The transport destruction path can no longer assume that there
>>>>>>>>> is a valid ID.
>>>>>>>>> 
>>>>>>>>> - When destroying a transport, ensure that ib_free_cq() is not
>>>>>>>>> invoked on a CQ that was already released.
>>>>>>>>> 
>>>>>>>>> Reported-by: Michal Kalderon <Michal.Kalderon@xxxxxxxxxx>
>>>>>>>>> Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA from
>>>>>>>>> ...")
>>>>>>>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>>>>>>> Cc: stable@xxxxxxxxxxxxxxx
>>>>>>>>> ---
>>>>>>>>> net/sunrpc/xprtrdma/verbs.c |   13 +++++++++----
>>>>>>>>> 1 file changed, 9 insertions(+), 4 deletions(-)
>>>>>>>>> 
>>>>>>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c
>>>>>>>>> b/net/sunrpc/xprtrdma/verbs.c index d19ea02..c284ee7 100644
>>>>>>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>>>>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>>>>>>>> @@ -251,7 +251,6 @@
>>>>>>>>> 		wait_for_completion(&ia->ri_remove_done);
>>>>>>>>> 
>>>>>>>>> 		ia->ri_id = NULL;
>>>>>>>>> -		ia->ri_pd = NULL;
>>>>>>>>> 		ia->ri_device = NULL;
>>>>>>>>> 		/* Return 1 to ensure the core destroys the id. */
>>>>>>>>> 		return 1;
>>>>>>>>> @@ -449,7 +448,9 @@
>>>>>>>>> 		ia->ri_id->qp = NULL;
>>>>>>>>> 	}
>>>>>>>>> 	ib_free_cq(ep->rep_attr.recv_cq);
>>>>>>>>> +	ep->rep_attr.recv_cq = NULL;
>>>>>>>>> 	ib_free_cq(ep->rep_attr.send_cq);
>>>>>>>>> +	ep->rep_attr.send_cq = NULL;
>>>>>>>>> 
>>>>>>>>> 	/* The ULP is responsible for ensuring all DMA
>>>>>>>>> 	 * mappings and MRs are gone.
>>>>>>>>> @@ -462,6 +463,8 @@
>>>>>>>>> 		rpcrdma_dma_unmap_regbuf(req->rl_recvbuf);
>>>>>>>>> 	}
>>>>>>>>> 	rpcrdma_mrs_destroy(buf);
>>>>>>>>> +	ib_dealloc_pd(ia->ri_pd);
>>>>>>>>> +	ia->ri_pd = NULL;
>>>>>>>>> 
>>>>>>>>> 	/* Allow waiters to continue */
>>>>>>>>> 	complete(&ia->ri_remove_done);
>>>>>>>>> @@ -629,14 +632,16 @@
>>>>>>>>> {
>>>>>>>>> 	cancel_delayed_work_sync(&ep->rep_connect_worker);
>>>>>>>>> 
>>>>>>>>> -	if (ia->ri_id->qp) {
>>>>>>>>> +	if (ia->ri_id && ia->ri_id->qp) {
>>>>>>>>> 		rpcrdma_ep_disconnect(ep, ia);
>>>>>>>>> 		rdma_destroy_qp(ia->ri_id);
>>>>>>>>> 		ia->ri_id->qp = NULL;
>>>>>>>>> 	}
>>>>>>>>> 
>>>>>>>>> -	ib_free_cq(ep->rep_attr.recv_cq);
>>>>>>>>> -	ib_free_cq(ep->rep_attr.send_cq);
>>>>>>>>> +	if (ep->rep_attr.recv_cq)
>>>>>>>>> +		ib_free_cq(ep->rep_attr.recv_cq);
>>>>>>>>> +	if (ep->rep_attr.send_cq)
>>>>>>>>> +		ib_free_cq(ep->rep_attr.send_cq);
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> /* Re-establish a connection after a device removal event.
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-
>> rdma"
>>>>>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
>>>>>>>> majordomo
>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Chuck Lever
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
>>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
>>>> majordomo
>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>> 
>>>>> --
>>>>> Chuck Lever
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
>>>> majordomo
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> --
>>>> Chuck Lever
>> 
>> --
>> Chuck Lever
>> 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux