On Mon, 2018-12-17 at 14:00 -0500, Chuck Lever wrote: > > On Dec 17, 2018, at 1:55 PM, Trond Myklebust < > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > On Mon, 2018-12-17 at 13:37 -0500, Chuck Lever wrote: > > > > On Dec 17, 2018, at 12:28 PM, Trond Myklebust < > > > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote: > > > > > Transport disconnect processing does a "wake pending tasks" > > > > > at > > > > > various points. > > > > > > > > > > Suppose an RPC Reply is being processed. The RPC task that > > > > > Reply > > > > > goes with is waiting on the pending queue. If a disconnect > > > > > wake- > > > > > up > > > > > happens before reply processing is done, that reply, even if > > > > > it > > > > > is > > > > > good, is thrown away, and the RPC has to be sent again. > > > > > > > > > > This window apparently does not exist for socket transports > > > > > because > > > > > there is a lock held while a reply is being received which > > > > > prevents > > > > > the wake-up call until after reply processing is done. > > > > > > > > > > To resolve this, all RPC replies being processed on an RPC- > > > > > over- > > > > > RDMA > > > > > transport have to complete before pending tasks are awoken > > > > > due to > > > > > a > > > > > transport disconnect. > > > > > > > > > > Callers that already hold the transport write lock may invoke > > > > > ->ops->close directly. Others use a generic helper that > > > > > schedules > > > > > a close when the write lock can be taken safely. > > > > > > > > > > Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> > > > > > --- > > > > > include/linux/sunrpc/xprt.h | 1 + > > > > > net/sunrpc/xprt.c | 19 > > > > > +++++++++++++++++++ > > > > > net/sunrpc/xprtrdma/backchannel.c | 13 +++++++-- > > > > > ---- > > > > > net/sunrpc/xprtrdma/svc_rdma_backchannel.c | 8 +++++--- > > > > > net/sunrpc/xprtrdma/transport.c | 16 ++++++++++- > > > > > ---- > > > > > - > > > > > net/sunrpc/xprtrdma/verbs.c | 5 ++--- > > > > > 6 files changed, 44 insertions(+), 18 deletions(-) > > > > > > > > > > diff --git a/include/linux/sunrpc/xprt.h > > > > > b/include/linux/sunrpc/xprt.h > > > > > index a4ab4f8..ee94ed0 100644 > > > > > --- a/include/linux/sunrpc/xprt.h > > > > > +++ b/include/linux/sunrpc/xprt.h > > > > > @@ -401,6 +401,7 @@ static inline __be32 > > > > > *xprt_skip_transport_header(struct rpc_xprt *xprt, __be32 * > > > > > bool xprt_request_get_cong(struct rpc_xprt > > > > > *xprt, > > > > > struct rpc_rqst *req); > > > > > void xprt_disconnect_done(struct rpc_xprt > > > > > *xprt); > > > > > void xprt_force_disconnect(struct rpc_xprt > > > > > *xprt); > > > > > +void xprt_disconnect_nowake(struct rpc_xprt > > > > > *xprt); > > > > > void xprt_conditional_disconnect(struct > > > > > rpc_xprt > > > > > *xprt, unsigned int cookie); > > > > > > > > > > bool xprt_lock_connect(struct rpc_xprt *, > > > > > struct > > > > > rpc_task *, void *); > > > > > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c > > > > > index ce92700..afe412e 100644 > > > > > --- a/net/sunrpc/xprt.c > > > > > +++ b/net/sunrpc/xprt.c > > > > > @@ -685,6 +685,25 @@ void xprt_force_disconnect(struct > > > > > rpc_xprt > > > > > *xprt) > > > > > } > > > > > EXPORT_SYMBOL_GPL(xprt_force_disconnect); > > > > > > > > > > +/** > > > > > + * xprt_disconnect_nowake - force a call to xprt->ops->close > > > > > + * @xprt: transport to disconnect > > > > > + * > > > > > + * The caller must ensure that xprt_wake_pending_tasks() is > > > > > + * called later. > > > > > + */ > > > > > +void xprt_disconnect_nowake(struct rpc_xprt *xprt) > > > > > +{ > > > > > + /* Don't race with the test_bit() in > > > > > xprt_clear_locked() > > > > > */ > > > > > + spin_lock_bh(&xprt->transport_lock); > > > > > + set_bit(XPRT_CLOSE_WAIT, &xprt->state); > > > > > + /* Try to schedule an autoclose RPC call */ > > > > > + if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0) > > > > > + queue_work(xprtiod_workqueue, &xprt- > > > > > > task_cleanup); > > > > > + spin_unlock_bh(&xprt->transport_lock); > > > > > +} > > > > > +EXPORT_SYMBOL_GPL(xprt_disconnect_nowake); > > > > > + > > > > > > > > We shouldn't need both xprt_disconnect_nowake() and > > > > xprt_force_disconnect() to be exported given that you can build > > > > the > > > > latter from the former + xprt_wake_pending_tasks() (which is > > > > also > > > > already exported). > > > > > > Thanks for your review! > > > > > > I can get rid of xprt_disconnect_nowake. There are some > > > variations, > > > depending on why wake_pending_tasks is protected by xprt- > > > > transport_lock. > > > > I'm having some second thoughts about the patch that Scott sent out > > last week to fix the issue that Dave and he were seeing. I think > > that > > what we really need to do to fix his issue is to call > > xprt_disconnect_done() after we've released the TCP socket. > > > > Given that realisation, I think that we can fix up > > xprt_force_disconnect() to only wake up the task that holds the > > XPRT_LOCKED instead of doing a thundering herd wakeup like we do > > today. > > That should (I think) obviate the need for a separate > > xprt_disconnect_nowake(). > > For RPC-over-RDMA, there really is a dangerous race between the > waking > task(s) and work being done by the deferred RPC completion handler. > IMO > the only safe thing RPC-over-RDMA can do is not wake anything until > the > deferred queue is well and truly drained. The deferred RPC completion handler (and hence the close) cannot execute if another task is holding XPRT_LOCKED, so we do need to wake up that task (and only that one). Note that in the new code, the only reason why a task would be holding XPRT_LOCKED while sleeping is because 1. It is waiting for a connection attempt to complete following a call to xprt_connect(). 2. It is waiting for a write_space event following an attempt to transmit. > As you observed when we last spoke, socket transports are already > protected from this race by the socket lock.... RPC-over-RDMA is > going > to have to be more careful. > > > > A patch is forthcoming later today. I'll make sure you are Cced so > > you > > can comment. > > > > -- > > Trond Myklebust > > Linux NFS client maintainer, Hammerspace > > trond.myklebust@xxxxxxxxxxxxxxx > > -- > Chuck Lever > > > -- Trond Myklebust CTO, Hammerspace Inc 4300 El Camino Real, Suite 105 Los Altos, CA 94022 www.hammer.space