> On Dec 17, 2018, at 2:09 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Mon, 2018-12-17 at 14:00 -0500, Chuck Lever wrote: >>> On Dec 17, 2018, at 1:55 PM, Trond Myklebust < >>> trondmy@xxxxxxxxxxxxxxx> wrote: >>> >>> On Mon, 2018-12-17 at 13:37 -0500, Chuck Lever wrote: >>>>> On Dec 17, 2018, at 12:28 PM, Trond Myklebust < >>>>> trondmy@xxxxxxxxxxxxxxx> wrote: >>>>> >>>>> On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote: >>>>>> Transport disconnect processing does a "wake pending tasks" >>>>>> at >>>>>> various points. >>>>>> >>>>>> Suppose an RPC Reply is being processed. The RPC task that >>>>>> Reply >>>>>> goes with is waiting on the pending queue. If a disconnect >>>>>> wake- >>>>>> up >>>>>> happens before reply processing is done, that reply, even if >>>>>> it >>>>>> is >>>>>> good, is thrown away, and the RPC has to be sent again. >>>>>> >>>>>> This window apparently does not exist for socket transports >>>>>> because >>>>>> there is a lock held while a reply is being received which >>>>>> prevents >>>>>> the wake-up call until after reply processing is done. >>>>>> >>>>>> To resolve this, all RPC replies being processed on an RPC- >>>>>> over- >>>>>> RDMA >>>>>> transport have to complete before pending tasks are awoken >>>>>> due to >>>>>> a >>>>>> transport disconnect. >>>>>> >>>>>> Callers that already hold the transport write lock may invoke >>>>>> ->ops->close directly. Others use a generic helper that >>>>>> schedules >>>>>> a close when the write lock can be taken safely. >>>>>> >>>>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> >>>>>> --- >>>>>> include/linux/sunrpc/xprt.h | 1 + >>>>>> net/sunrpc/xprt.c | 19 >>>>>> +++++++++++++++++++ >>>>>> net/sunrpc/xprtrdma/backchannel.c | 13 +++++++-- >>>>>> ---- >>>>>> net/sunrpc/xprtrdma/svc_rdma_backchannel.c | 8 +++++--- >>>>>> net/sunrpc/xprtrdma/transport.c | 16 ++++++++++- >>>>>> ---- >>>>>> - >>>>>> net/sunrpc/xprtrdma/verbs.c | 5 ++--- >>>>>> 6 files changed, 44 insertions(+), 18 deletions(-) >>>>>> >>>>>> diff --git a/include/linux/sunrpc/xprt.h >>>>>> b/include/linux/sunrpc/xprt.h >>>>>> index a4ab4f8..ee94ed0 100644 >>>>>> --- a/include/linux/sunrpc/xprt.h >>>>>> +++ b/include/linux/sunrpc/xprt.h >>>>>> @@ -401,6 +401,7 @@ static inline __be32 >>>>>> *xprt_skip_transport_header(struct rpc_xprt *xprt, __be32 * >>>>>> bool xprt_request_get_cong(struct rpc_xprt >>>>>> *xprt, >>>>>> struct rpc_rqst *req); >>>>>> void xprt_disconnect_done(struct rpc_xprt >>>>>> *xprt); >>>>>> void xprt_force_disconnect(struct rpc_xprt >>>>>> *xprt); >>>>>> +void xprt_disconnect_nowake(struct rpc_xprt >>>>>> *xprt); >>>>>> void xprt_conditional_disconnect(struct >>>>>> rpc_xprt >>>>>> *xprt, unsigned int cookie); >>>>>> >>>>>> bool xprt_lock_connect(struct rpc_xprt *, >>>>>> struct >>>>>> rpc_task *, void *); >>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c >>>>>> index ce92700..afe412e 100644 >>>>>> --- a/net/sunrpc/xprt.c >>>>>> +++ b/net/sunrpc/xprt.c >>>>>> @@ -685,6 +685,25 @@ void xprt_force_disconnect(struct >>>>>> rpc_xprt >>>>>> *xprt) >>>>>> } >>>>>> EXPORT_SYMBOL_GPL(xprt_force_disconnect); >>>>>> >>>>>> +/** >>>>>> + * xprt_disconnect_nowake - force a call to xprt->ops->close >>>>>> + * @xprt: transport to disconnect >>>>>> + * >>>>>> + * The caller must ensure that xprt_wake_pending_tasks() is >>>>>> + * called later. >>>>>> + */ >>>>>> +void xprt_disconnect_nowake(struct rpc_xprt *xprt) >>>>>> +{ >>>>>> + /* Don't race with the test_bit() in >>>>>> xprt_clear_locked() >>>>>> */ >>>>>> + spin_lock_bh(&xprt->transport_lock); >>>>>> + set_bit(XPRT_CLOSE_WAIT, &xprt->state); >>>>>> + /* Try to schedule an autoclose RPC call */ >>>>>> + if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0) >>>>>> + queue_work(xprtiod_workqueue, &xprt- >>>>>>> task_cleanup); >>>>>> + spin_unlock_bh(&xprt->transport_lock); >>>>>> +} >>>>>> +EXPORT_SYMBOL_GPL(xprt_disconnect_nowake); >>>>>> + >>>>> >>>>> We shouldn't need both xprt_disconnect_nowake() and >>>>> xprt_force_disconnect() to be exported given that you can build >>>>> the >>>>> latter from the former + xprt_wake_pending_tasks() (which is >>>>> also >>>>> already exported). >>>> >>>> Thanks for your review! >>>> >>>> I can get rid of xprt_disconnect_nowake. There are some >>>> variations, >>>> depending on why wake_pending_tasks is protected by xprt- >>>>> transport_lock. >>> >>> I'm having some second thoughts about the patch that Scott sent out >>> last week to fix the issue that Dave and he were seeing. I think >>> that >>> what we really need to do to fix his issue is to call >>> xprt_disconnect_done() after we've released the TCP socket. >>> >>> Given that realisation, I think that we can fix up >>> xprt_force_disconnect() to only wake up the task that holds the >>> XPRT_LOCKED instead of doing a thundering herd wakeup like we do >>> today. >>> That should (I think) obviate the need for a separate >>> xprt_disconnect_nowake(). >> >> For RPC-over-RDMA, there really is a dangerous race between the >> waking >> task(s) and work being done by the deferred RPC completion handler. >> IMO >> the only safe thing RPC-over-RDMA can do is not wake anything until >> the >> deferred queue is well and truly drained. > > The deferred RPC completion handler (and hence the close) cannot > execute if another task is holding XPRT_LOCKED, Just to be certain we are speaking of the same thing, rpcrdma_deferred_completion is queued by the Receive handler, and can indeed run independently of an rpc_task. It is always running outside the purview of XPRT_LOCKED. > so we do need to wake up that task (and only that one). > > Note that in the new code, the only reason why a task would be holding > XPRT_LOCKED while sleeping is because > > 1. It is waiting for a connection attempt to complete following a call > to xprt_connect(). > 2. It is waiting for a write_space event following an attempt to > transmit. xprt_rdma_close can sleep in rpcrdma_ep_disconnect: -> ib_drain_{qp,sq,rq} can all sleep waiting for the last FLUSH -> drain_workqueue, added in this patch, can sleep waiting for the deferred RPC completion workqueue to drain >> As you observed when we last spoke, socket transports are already >> protected from this race by the socket lock.... RPC-over-RDMA is >> going >> to have to be more careful. >> >> >>> A patch is forthcoming later today. I'll make sure you are Cced so >>> you >>> can comment. >>> >>> -- >>> Trond Myklebust >>> Linux NFS client maintainer, Hammerspace >>> trond.myklebust@xxxxxxxxxxxxxxx >> >> -- >> Chuck Lever >> >> >> > -- > Trond Myklebust > CTO, Hammerspace Inc > 4300 El Camino Real, Suite 105 > Los Altos, CA 94022 > www.hammer.space -- Chuck Lever