Re: [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Dec 17, 2018, at 2:09 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> On Mon, 2018-12-17 at 14:00 -0500, Chuck Lever wrote:
>>> On Dec 17, 2018, at 1:55 PM, Trond Myklebust <
>>> trondmy@xxxxxxxxxxxxxxx> wrote:
>>> 
>>> On Mon, 2018-12-17 at 13:37 -0500, Chuck Lever wrote:
>>>>> On Dec 17, 2018, at 12:28 PM, Trond Myklebust <
>>>>> trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>> 
>>>>> On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote:
>>>>>> Transport disconnect processing does a "wake pending tasks"
>>>>>> at
>>>>>> various points.
>>>>>> 
>>>>>> Suppose an RPC Reply is being processed. The RPC task that
>>>>>> Reply
>>>>>> goes with is waiting on the pending queue. If a disconnect
>>>>>> wake-
>>>>>> up
>>>>>> happens before reply processing is done, that reply, even if
>>>>>> it
>>>>>> is
>>>>>> good, is thrown away, and the RPC has to be sent again.
>>>>>> 
>>>>>> This window apparently does not exist for socket transports
>>>>>> because
>>>>>> there is a lock held while a reply is being received which
>>>>>> prevents
>>>>>> the wake-up call until after reply processing is done.
>>>>>> 
>>>>>> To resolve this, all RPC replies being processed on an RPC-
>>>>>> over-
>>>>>> RDMA
>>>>>> transport have to complete before pending tasks are awoken
>>>>>> due to
>>>>>> a
>>>>>> transport disconnect.
>>>>>> 
>>>>>> Callers that already hold the transport write lock may invoke
>>>>>> ->ops->close directly. Others use a generic helper that
>>>>>> schedules
>>>>>> a close when the write lock can be taken safely.
>>>>>> 
>>>>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>>>> ---
>>>>>> include/linux/sunrpc/xprt.h                |    1 +
>>>>>> net/sunrpc/xprt.c                          |   19
>>>>>> +++++++++++++++++++
>>>>>> net/sunrpc/xprtrdma/backchannel.c          |   13 +++++++--
>>>>>> ----
>>>>>> net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    8 +++++---
>>>>>> net/sunrpc/xprtrdma/transport.c            |   16 ++++++++++-
>>>>>> ----
>>>>>> -
>>>>>> net/sunrpc/xprtrdma/verbs.c                |    5 ++---
>>>>>> 6 files changed, 44 insertions(+), 18 deletions(-)
>>>>>> 
>>>>>> diff --git a/include/linux/sunrpc/xprt.h
>>>>>> b/include/linux/sunrpc/xprt.h
>>>>>> index a4ab4f8..ee94ed0 100644
>>>>>> --- a/include/linux/sunrpc/xprt.h
>>>>>> +++ b/include/linux/sunrpc/xprt.h
>>>>>> @@ -401,6 +401,7 @@ static inline __be32
>>>>>> *xprt_skip_transport_header(struct rpc_xprt *xprt, __be32 *
>>>>>> bool			xprt_request_get_cong(struct rpc_xprt
>>>>>> *xprt,
>>>>>> struct rpc_rqst *req);
>>>>>> void			xprt_disconnect_done(struct rpc_xprt
>>>>>> *xprt);
>>>>>> void			xprt_force_disconnect(struct rpc_xprt
>>>>>> *xprt);
>>>>>> +void			xprt_disconnect_nowake(struct rpc_xprt
>>>>>> *xprt);
>>>>>> void			xprt_conditional_disconnect(struct
>>>>>> rpc_xprt
>>>>>> *xprt, unsigned int cookie);
>>>>>> 
>>>>>> bool			xprt_lock_connect(struct rpc_xprt *,
>>>>>> struct
>>>>>> rpc_task *, void *);
>>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>>>> index ce92700..afe412e 100644
>>>>>> --- a/net/sunrpc/xprt.c
>>>>>> +++ b/net/sunrpc/xprt.c
>>>>>> @@ -685,6 +685,25 @@ void xprt_force_disconnect(struct
>>>>>> rpc_xprt
>>>>>> *xprt)
>>>>>> }
>>>>>> EXPORT_SYMBOL_GPL(xprt_force_disconnect);
>>>>>> 
>>>>>> +/**
>>>>>> + * xprt_disconnect_nowake - force a call to xprt->ops->close
>>>>>> + * @xprt: transport to disconnect
>>>>>> + *
>>>>>> + * The caller must ensure that xprt_wake_pending_tasks() is
>>>>>> + * called later.
>>>>>> + */
>>>>>> +void xprt_disconnect_nowake(struct rpc_xprt *xprt)
>>>>>> +{
>>>>>> +       /* Don't race with the test_bit() in
>>>>>> xprt_clear_locked()
>>>>>> */
>>>>>> +       spin_lock_bh(&xprt->transport_lock);
>>>>>> +       set_bit(XPRT_CLOSE_WAIT, &xprt->state);
>>>>>> +       /* Try to schedule an autoclose RPC call */
>>>>>> +       if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0)
>>>>>> +               queue_work(xprtiod_workqueue, &xprt-
>>>>>>> task_cleanup);
>>>>>> +       spin_unlock_bh(&xprt->transport_lock);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(xprt_disconnect_nowake);
>>>>>> +
>>>>> 
>>>>> We shouldn't need both xprt_disconnect_nowake() and
>>>>> xprt_force_disconnect() to be exported given that you can build
>>>>> the
>>>>> latter from the former + xprt_wake_pending_tasks() (which is
>>>>> also
>>>>> already exported).
>>>> 
>>>> Thanks for your review!
>>>> 
>>>> I can get rid of xprt_disconnect_nowake. There are some
>>>> variations,
>>>> depending on why wake_pending_tasks is protected by xprt-
>>>>> transport_lock.
>>> 
>>> I'm having some second thoughts about the patch that Scott sent out
>>> last week to fix the issue that Dave and he were seeing. I think
>>> that
>>> what we really need to do to fix his issue is to call
>>> xprt_disconnect_done() after we've released the TCP socket.
>>> 
>>> Given that realisation, I think that we can fix up
>>> xprt_force_disconnect() to only wake up the task that holds the
>>> XPRT_LOCKED instead of doing a thundering herd wakeup like we do
>>> today.
>>> That should (I think) obviate the need for a separate
>>> xprt_disconnect_nowake().
>> 
>> For RPC-over-RDMA, there really is a dangerous race between the
>> waking
>> task(s) and work being done by the deferred RPC completion handler.
>> IMO
>> the only safe thing RPC-over-RDMA can do is not wake anything until
>> the
>> deferred queue is well and truly drained.
> 
> The deferred RPC completion handler (and hence the close) cannot
> execute if another task is holding XPRT_LOCKED,

Just to be certain we are speaking of the same thing,
rpcrdma_deferred_completion is queued by the Receive handler, and
can indeed run independently of an rpc_task. It is always running
outside the purview of XPRT_LOCKED.


> so we do need to wake up that task (and only that one).
> 
> Note that in the new code, the only reason why a task would be holding
> XPRT_LOCKED while sleeping is because
> 
>   1. It is waiting for a connection attempt to complete following a call
>      to xprt_connect().
>   2. It is waiting for a write_space event following an attempt to
>      transmit.

xprt_rdma_close can sleep in rpcrdma_ep_disconnect:

 -> ib_drain_{qp,sq,rq} can all sleep waiting for the last FLUSH

 -> drain_workqueue, added in this patch, can sleep waiting for the
    deferred RPC completion workqueue to drain


>> As you observed when we last spoke, socket transports are already
>> protected from this race by the socket lock.... RPC-over-RDMA is
>> going
>> to have to be more careful.
>> 
>> 
>>> A patch is forthcoming later today. I'll make sure you are Cced so
>>> you
>>> can comment.
>>> 
>>> -- 
>>> Trond Myklebust
>>> Linux NFS client maintainer, Hammerspace
>>> trond.myklebust@xxxxxxxxxxxxxxx
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
> -- 
> Trond Myklebust
> CTO, Hammerspace Inc
> 4300 El Camino Real, Suite 105
> Los Altos, CA 94022
> www.hammer.space

--
Chuck Lever







[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux