On Mon, 2018-12-17 at 14:19 -0500, Chuck Lever wrote: > > On Dec 17, 2018, at 2:09 PM, Trond Myklebust < > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > On Mon, 2018-12-17 at 14:00 -0500, Chuck Lever wrote: > > > > On Dec 17, 2018, at 1:55 PM, Trond Myklebust < > > > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > On Mon, 2018-12-17 at 13:37 -0500, Chuck Lever wrote: > > > > > > On Dec 17, 2018, at 12:28 PM, Trond Myklebust < > > > > > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote: > > > > > > > Transport disconnect processing does a "wake pending > > > > > > > tasks" > > > > > > > at > > > > > > > various points. > > > > > > > > > > > > > > Suppose an RPC Reply is being processed. The RPC task > > > > > > > that > > > > > > > Reply > > > > > > > goes with is waiting on the pending queue. If a > > > > > > > disconnect > > > > > > > wake- > > > > > > > up > > > > > > > happens before reply processing is done, that reply, even > > > > > > > if > > > > > > > it > > > > > > > is > > > > > > > good, is thrown away, and the RPC has to be sent again. > > > > > > > > > > > > > > This window apparently does not exist for socket > > > > > > > transports > > > > > > > because > > > > > > > there is a lock held while a reply is being received > > > > > > > which > > > > > > > prevents > > > > > > > the wake-up call until after reply processing is done. > > > > > > > > > > > > > > To resolve this, all RPC replies being processed on an > > > > > > > RPC- > > > > > > > over- > > > > > > > RDMA > > > > > > > transport have to complete before pending tasks are > > > > > > > awoken > > > > > > > due to > > > > > > > a > > > > > > > transport disconnect. > > > > > > > > > > > > > > Callers that already hold the transport write lock may > > > > > > > invoke > > > > > > > ->ops->close directly. Others use a generic helper that > > > > > > > schedules > > > > > > > a close when the write lock can be taken safely. > > > > > > > > > > > > > > Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> > > > > > > > --- > > > > > > > include/linux/sunrpc/xprt.h | 1 + > > > > > > > net/sunrpc/xprt.c | 19 > > > > > > > +++++++++++++++++++ > > > > > > > net/sunrpc/xprtrdma/backchannel.c | 13 > > > > > > > +++++++-- > > > > > > > ---- > > > > > > > net/sunrpc/xprtrdma/svc_rdma_backchannel.c | 8 +++++ > > > > > > > --- > > > > > > > net/sunrpc/xprtrdma/transport.c | 16 > > > > > > > ++++++++++- > > > > > > > ---- > > > > > > > - > > > > > > > net/sunrpc/xprtrdma/verbs.c | 5 ++--- > > > > > > > 6 files changed, 44 insertions(+), 18 deletions(-) > > > > > > > > > > > > > > diff --git a/include/linux/sunrpc/xprt.h > > > > > > > b/include/linux/sunrpc/xprt.h > > > > > > > index a4ab4f8..ee94ed0 100644 > > > > > > > --- a/include/linux/sunrpc/xprt.h > > > > > > > +++ b/include/linux/sunrpc/xprt.h > > > > > > > @@ -401,6 +401,7 @@ static inline __be32 > > > > > > > *xprt_skip_transport_header(struct rpc_xprt *xprt, __be32 > > > > > > > * > > > > > > > bool xprt_request_get_cong(struct > > > > > > > rpc_xprt > > > > > > > *xprt, > > > > > > > struct rpc_rqst *req); > > > > > > > void xprt_disconnect_done(struct > > > > > > > rpc_xprt > > > > > > > *xprt); > > > > > > > void xprt_force_disconnect(struct > > > > > > > rpc_xprt > > > > > > > *xprt); > > > > > > > +void xprt_disconnect_nowake(struct > > > > > > > rpc_xprt > > > > > > > *xprt); > > > > > > > void xprt_conditional_disconnect(str > > > > > > > uct > > > > > > > rpc_xprt > > > > > > > *xprt, unsigned int cookie); > > > > > > > > > > > > > > bool xprt_lock_connect(struct > > > > > > > rpc_xprt *, > > > > > > > struct > > > > > > > rpc_task *, void *); > > > > > > > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c > > > > > > > index ce92700..afe412e 100644 > > > > > > > --- a/net/sunrpc/xprt.c > > > > > > > +++ b/net/sunrpc/xprt.c > > > > > > > @@ -685,6 +685,25 @@ void xprt_force_disconnect(struct > > > > > > > rpc_xprt > > > > > > > *xprt) > > > > > > > } > > > > > > > EXPORT_SYMBOL_GPL(xprt_force_disconnect); > > > > > > > > > > > > > > +/** > > > > > > > + * xprt_disconnect_nowake - force a call to xprt->ops- > > > > > > > >close > > > > > > > + * @xprt: transport to disconnect > > > > > > > + * > > > > > > > + * The caller must ensure that xprt_wake_pending_tasks() > > > > > > > is > > > > > > > + * called later. > > > > > > > + */ > > > > > > > +void xprt_disconnect_nowake(struct rpc_xprt *xprt) > > > > > > > +{ > > > > > > > + /* Don't race with the test_bit() in > > > > > > > xprt_clear_locked() > > > > > > > */ > > > > > > > + spin_lock_bh(&xprt->transport_lock); > > > > > > > + set_bit(XPRT_CLOSE_WAIT, &xprt->state); > > > > > > > + /* Try to schedule an autoclose RPC call */ > > > > > > > + if (test_and_set_bit(XPRT_LOCKED, &xprt->state) > > > > > > > == 0) > > > > > > > + queue_work(xprtiod_workqueue, &xprt- > > > > > > > > task_cleanup); > > > > > > > + spin_unlock_bh(&xprt->transport_lock); > > > > > > > +} > > > > > > > +EXPORT_SYMBOL_GPL(xprt_disconnect_nowake); > > > > > > > + > > > > > > > > > > > > We shouldn't need both xprt_disconnect_nowake() and > > > > > > xprt_force_disconnect() to be exported given that you can > > > > > > build > > > > > > the > > > > > > latter from the former + xprt_wake_pending_tasks() (which > > > > > > is > > > > > > also > > > > > > already exported). > > > > > > > > > > Thanks for your review! > > > > > > > > > > I can get rid of xprt_disconnect_nowake. There are some > > > > > variations, > > > > > depending on why wake_pending_tasks is protected by xprt- > > > > > > transport_lock. > > > > > > > > I'm having some second thoughts about the patch that Scott sent > > > > out > > > > last week to fix the issue that Dave and he were seeing. I > > > > think > > > > that > > > > what we really need to do to fix his issue is to call > > > > xprt_disconnect_done() after we've released the TCP socket. > > > > > > > > Given that realisation, I think that we can fix up > > > > xprt_force_disconnect() to only wake up the task that holds the > > > > XPRT_LOCKED instead of doing a thundering herd wakeup like we > > > > do > > > > today. > > > > That should (I think) obviate the need for a separate > > > > xprt_disconnect_nowake(). > > > > > > For RPC-over-RDMA, there really is a dangerous race between the > > > waking > > > task(s) and work being done by the deferred RPC completion > > > handler. > > > IMO > > > the only safe thing RPC-over-RDMA can do is not wake anything > > > until > > > the > > > deferred queue is well and truly drained. > > > > The deferred RPC completion handler (and hence the close) cannot > > execute if another task is holding XPRT_LOCKED, > > Just to be certain we are speaking of the same thing, > rpcrdma_deferred_completion is queued by the Receive handler, and > can indeed run independently of an rpc_task. It is always running > outside the purview of XPRT_LOCKED. No. I was thinking of the xprt->task_cleanup. It can't execute, and complete your close until it holds XPRT_LOCKED. > > > > so we do need to wake up that task (and only that one). > > > > Note that in the new code, the only reason why a task would be > > holding > > XPRT_LOCKED while sleeping is because > > > > 1. It is waiting for a connection attempt to complete following a > > call > > to xprt_connect(). > > 2. It is waiting for a write_space event following an attempt to > > transmit. > > xprt_rdma_close can sleep in rpcrdma_ep_disconnect: > > -> ib_drain_{qp,sq,rq} can all sleep waiting for the last FLUSH > > -> drain_workqueue, added in this patch, can sleep waiting for the > deferred RPC completion workqueue to drain > > > > > As you observed when we last spoke, socket transports are already > > > protected from this race by the socket lock.... RPC-over-RDMA is > > > going > > > to have to be more careful. > > > > > > > > > > A patch is forthcoming later today. I'll make sure you are Cced > > > > so > > > > you > > > > can comment. > > > > > > > > -- > > > > Trond Myklebust > > > > Linux NFS client maintainer, Hammerspace > > > > trond.myklebust@xxxxxxxxxxxxxxx > > > > > > -- > > > Chuck Lever > > > > > > > > > > > -- > > Trond Myklebust > > CTO, Hammerspace Inc > > 4300 El Camino Real, Suite 105 > > Los Altos, CA 94022 > > www.hammer.space > > -- > Chuck Lever > > > -- Trond Myklebust CTO, Hammerspace Inc 4300 El Camino Real, Suite 105 Los Altos, CA 94022 www.hammer.space