Re: [PATCH 1/3] SUNRPC: Don't hold the transport lock across socket copy operations

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Mon, 14 Aug 2017 20:38:31 +0000

On Mon, 2017-08-14 at 16:23 -0400, Chuck Lever wrote:
> > On Aug 14, 2017, at 4:07 PM, Trond Myklebust <trondmy@primarydata.c
> > om> wrote:
> > 
> > On Mon, 2017-08-14 at 15:28 -0400, Chuck Lever wrote:
> > > > On Aug 14, 2017, at 3:16 PM, Trond Myklebust <trond.myklebust@p
> > > > rima
> > > > rydata.com> wrote:
> > > > 
> > > > Instead add a mechanism to ensure that the request doesn't
> > > > disappear
> > > > from underneath us while copying from the socket. We do this by
> > > > preventing xprt_release() from freeing the XDR buffers until
> > > > the
> > > > flag RPC_TASK_MSG_RECV has been cleared from the request.
> > > > 
> > > > Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx
> > > > >
> > > > ---
> > > > include/linux/sunrpc/sched.h |  2 ++
> > > > include/linux/sunrpc/xprt.h  |  2 ++
> > > > net/sunrpc/xprt.c            | 42
> > > > ++++++++++++++++++++++++++++++++++++++++++
> > > > net/sunrpc/xprtsock.c        | 27 ++++++++++++++++++++++-----
> > > > 4 files changed, 68 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/include/linux/sunrpc/sched.h
> > > > b/include/linux/sunrpc/sched.h
> > > > index 15bc1cd6ee5c..e972d9e05426 100644
> > > > --- a/include/linux/sunrpc/sched.h
> > > > +++ b/include/linux/sunrpc/sched.h
> > > > @@ -141,6 +141,8 @@ struct rpc_task_setup {
> > > > #define RPC_TASK_ACTIVE		2
> > > > #define RPC_TASK_MSG_XMIT	3
> > > > #define RPC_TASK_MSG_XMIT_WAIT	4
> > > > +#define RPC_TASK_MSG_RECV	5
> > > > +#define RPC_TASK_MSG_RECV_WAIT	6
> > > > 
> > > > #define RPC_IS_RUNNING(t)	test_bit(RPC_TASK_RUNNING,
> > > > &(t)-
> > > > > tk_runstate)
> > > > 
> > > > #define rpc_set_running(t)	set_bit(RPC_TASK_RUNNING,
> > > > &(t)-
> > > > > tk_runstate)
> > > > 
> > > > diff --git a/include/linux/sunrpc/xprt.h
> > > > b/include/linux/sunrpc/xprt.h
> > > > index eab1c749e192..65b9e0224753 100644
> > > > --- a/include/linux/sunrpc/xprt.h
> > > > +++ b/include/linux/sunrpc/xprt.h
> > > > @@ -372,6 +372,8 @@ void			xprt_write_spac
> > > > e(st
> > > > ruct rpc_xprt *xprt);
> > > > void			xprt_adjust_cwnd(struct rpc_xprt
> > > > *xprt,
> > > > struct rpc_task *task, int result);
> > > > struct rpc_rqst *	xprt_lookup_rqst(struct rpc_xprt
> > > > *xprt,
> > > > __be32 xid);
> > > > void			xprt_complete_rqst(struct rpc_task
> > > > *task, int copied);
> > > > +void			xprt_pin_rqst(struct rpc_rqst
> > > > *req);
> > > > +void			xprt_unpin_rqst(struct rpc_rqst
> > > > *req);
> > > > void			xprt_release_rqst_cong(struct
> > > > rpc_task
> > > > *task);
> > > > void			xprt_disconnect_done(struct
> > > > rpc_xprt
> > > > *xprt);
> > > > void			xprt_force_disconnect(struct
> > > > rpc_xprt
> > > > *xprt);
> > > > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > > > index 788c1b6138c2..62c379865f7c 100644
> > > > --- a/net/sunrpc/xprt.c
> > > > +++ b/net/sunrpc/xprt.c
> > > > @@ -844,6 +844,47 @@ struct rpc_rqst *xprt_lookup_rqst(struct
> > > > rpc_xprt *xprt, __be32 xid)
> > > > }
> > > > EXPORT_SYMBOL_GPL(xprt_lookup_rqst);
> > > > 
> > > > +/**
> > > > + * xprt_pin_rqst - Pin a request on the transport receive list
> > > > + * @req: Request to pin
> > > > + *
> > > > + * Caller must ensure this is atomic with the call to
> > > > xprt_lookup_rqst()
> > > > + * so should be holding the xprt transport lock.
> > > > + */
> > > > +void xprt_pin_rqst(struct rpc_rqst *req)
> > > > +{
> > > > +	set_bit(RPC_TASK_MSG_RECV, &req->rq_task-
> > > > >tk_runstate);
> > > > +}
> > > > +
> > > > +/**
> > > > + * xprt_unpin_rqst - Unpin a request on the transport receive
> > > > list
> > > > + * @req: Request to pin
> > > > + *
> > > > + * Caller should be holding the xprt transport lock.
> > > > + */
> > > > +void xprt_unpin_rqst(struct rpc_rqst *req)
> > > > +{
> > > > +	struct rpc_task *task = req->rq_task;
> > > > +
> > > > +	clear_bit(RPC_TASK_MSG_RECV, &task->tk_runstate);
> > > > +	if (test_bit(RPC_TASK_MSG_RECV_WAIT, &task-
> > > > >tk_runstate))
> > > > +		wake_up_bit(&task->tk_runstate,
> > > > RPC_TASK_MSG_RECV);
> > > > +}
> > > > +
> > > > +static void xprt_wait_on_pinned_rqst(struct rpc_rqst *req)
> > > > +__must_hold(&req->rq_xprt->transport_lock)
> > > > +{
> > > > +	struct rpc_task *task = req->rq_task;
> > > > +	if (test_bit(RPC_TASK_MSG_RECV, &task->tk_runstate)) {
> > > > +		spin_unlock_bh(&req->rq_xprt->transport_lock);
> > > > +		set_bit(RPC_TASK_MSG_RECV_WAIT, &task-
> > > > > tk_runstate);
> > > > 
> > > > +		wait_on_bit(&task->tk_runstate,
> > > > RPC_TASK_MSG_RECV,
> > > > +				TASK_UNINTERRUPTIBLE);
> > > > +		clear_bit(RPC_TASK_MSG_RECV_WAIT, &task-
> > > > > tk_runstate);
> > > > 
> > > > +		spin_lock_bh(&req->rq_xprt->transport_lock);
> > > > +	}
> > > > +}
> > > > +
> > > > static void xprt_update_rtt(struct rpc_task *task)
> > > > {
> > > > 	struct rpc_rqst *req = task->tk_rqstp;
> > > > @@ -1301,6 +1342,7 @@ void xprt_release(struct rpc_task *task)
> > > > 		list_del(&req->rq_list);
> > > > 	xprt->last_used = jiffies;
> > > > 	xprt_schedule_autodisconnect(xprt);
> > > > +	xprt_wait_on_pinned_rqst(req);
> > > > 	spin_unlock_bh(&xprt->transport_lock);
> > > 
> > > Is it OK to call wait_on_bit(TASK_UNINTERRUPTIBLE) while holding
> > > a BH spin lock? This could be prone to deadlock.
> > 
> > We drop the lock inside xprt_wait_on_pinned_rqst() if we need to
> > wait.
> > 
> > The reason why we want to hold the lock there is to avoid a use-
> > after-
> > free in xprt_unpin_rqst(). Without the lock, there is a risk that
> > xprt_release() could complete, and the task get freed before we
> > have
> > completed wake_up_bit().
> 
> Agree with the last bit. I had that challenge with the order of
> memory invalidation (which waits) and xprt_lookup_rqst/complete_rqst
> in rpcrdma_reply_handler.
> 
> However my concern is that there are many other users of the
> transport_lock. Does it become more tricky not to introduce a
> problem by changing any one of the sites that take transport_lock?

What is your concern? The main reason for taking the transport lock in
the current code is twofold:

1) Perform the lookup of the request on the receive list
2) Ensure the request doesn't disappear while we're writing to it.

> Despite the deadlock concern, I don't think it makes sense to call
> wait_on_bit while holding a spin lock simply because spin locks are
> supposed to be for things that are quick, like updating a data
> structure.
> 
> wait_on_bit can take microseconds: prepare_to_wait and finish_wait
> can both take a irqsave spin_lock, for example. bit_wait is the
> action that is done if the bit is not ready, and that calls
> schedule(). On a busy system, that kind of context switch can take
> dozens of microseconds.

Agreed, but we should only very rarely be hitting the wait case. In all
successful RPC calls, the receive will have happened long before we
call xprt_release(). In fact, the only 2 cases where this can happen
would be when the RPC call itself is interrupted. Either:

1) The task is synchronous and was interrupted by the user.
2) The task is terminating early due to a soft timeout.

In both those cases, we're in a non-performance path. Furthermore, they
 would currently end up spinning against the transport lock.

> IMO it would be a lot nicer if we had an FSM state available that
> could allow another sleep while an RPC task during xprt_release.

Why would that be preferable, if the argument holds that this only
happens in these 2 rare error cases?

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥