Re: [PATCH 1/3] SUNRPC: Don't hold the transport lock across socket copy operations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2017-08-14 at 18:25 -0400, Chuck Lever wrote:
> > On Aug 14, 2017, at 4:38 PM, Trond Myklebust <trondmy@primarydata.c
> > om> wrote:
> > 
> > On Mon, 2017-08-14 at 16:23 -0400, Chuck Lever wrote:
> > > > On Aug 14, 2017, at 4:07 PM, Trond Myklebust <trondmy@primaryda
> > > > ta.c
> > > > om> wrote:
> > > > 
> > > > On Mon, 2017-08-14 at 15:28 -0400, Chuck Lever wrote:
> > > > > > On Aug 14, 2017, at 3:16 PM, Trond Myklebust <trond.myklebu
> > > > > > st@p
> > > > > > rima
> > > > > > rydata.com> wrote:
> > > > > > 
> > > > > > Instead add a mechanism to ensure that the request doesn't
> > > > > > disappear
> > > > > > from underneath us while copying from the socket. We do
> > > > > > this by
> > > > > > preventing xprt_release() from freeing the XDR buffers
> > > > > > until
> > > > > > the
> > > > > > flag RPC_TASK_MSG_RECV has been cleared from the request.
> > > > > > 
> > > > > > Signed-off-by: Trond Myklebust <trond.myklebust@primarydata
> > > > > > .com
> > > > > > > 
> > > > > > 
> > > > > > ---
> > > > > > include/linux/sunrpc/sched.h |  2 ++
> > > > > > include/linux/sunrpc/xprt.h  |  2 ++
> > > > > > net/sunrpc/xprt.c            | 42
> > > > > > ++++++++++++++++++++++++++++++++++++++++++
> > > > > > net/sunrpc/xprtsock.c        | 27 ++++++++++++++++++++++---
> > > > > > --
> > > > > > 4 files changed, 68 insertions(+), 5 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/linux/sunrpc/sched.h
> > > > > > b/include/linux/sunrpc/sched.h
> > > > > > index 15bc1cd6ee5c..e972d9e05426 100644
> > > > > > --- a/include/linux/sunrpc/sched.h
> > > > > > +++ b/include/linux/sunrpc/sched.h
> > > > > > @@ -141,6 +141,8 @@ struct rpc_task_setup {
> > > > > > #define RPC_TASK_ACTIVE		2
> > > > > > #define RPC_TASK_MSG_XMIT	3
> > > > > > #define RPC_TASK_MSG_XMIT_WAIT	4
> > > > > > +#define RPC_TASK_MSG_RECV	5
> > > > > > +#define RPC_TASK_MSG_RECV_WAIT	6
> > > > > > 
> > > > > > #define RPC_IS_RUNNING(t)	test_bit(RPC_TASK_RUNNING,
> > > > > > &(t)-
> > > > > > > tk_runstate)
> > > > > > 
> > > > > > #define rpc_set_running(t)	set_bit(RPC_TASK_RUNNING,
> > > > > > &(t)-
> > > > > > > tk_runstate)
> > > > > > 
> > > > > > diff --git a/include/linux/sunrpc/xprt.h
> > > > > > b/include/linux/sunrpc/xprt.h
> > > > > > index eab1c749e192..65b9e0224753 100644
> > > > > > --- a/include/linux/sunrpc/xprt.h
> > > > > > +++ b/include/linux/sunrpc/xprt.h
> > > > > > @@ -372,6 +372,8 @@ void			xprt_write_
> > > > > > spac
> > > > > > e(st
> > > > > > ruct rpc_xprt *xprt);
> > > > > > void			xprt_adjust_cwnd(struct
> > > > > > rpc_xprt
> > > > > > *xprt,
> > > > > > struct rpc_task *task, int result);
> > > > > > struct rpc_rqst *	xprt_lookup_rqst(struct rpc_xprt
> > > > > > *xprt,
> > > > > > __be32 xid);
> > > > > > void			xprt_complete_rqst(struct
> > > > > > rpc_task
> > > > > > *task, int copied);
> > > > > > +void			xprt_pin_rqst(struct rpc_rqst
> > > > > > *req);
> > > > > > +void			xprt_unpin_rqst(struct
> > > > > > rpc_rqst
> > > > > > *req);
> > > > > > void			xprt_release_rqst_cong(struct
> > > > > > rpc_task
> > > > > > *task);
> > > > > > void			xprt_disconnect_done(struct
> > > > > > rpc_xprt
> > > > > > *xprt);
> > > > > > void			xprt_force_disconnect(struct
> > > > > > rpc_xprt
> > > > > > *xprt);
> > > > > > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > > > > > index 788c1b6138c2..62c379865f7c 100644
> > > > > > --- a/net/sunrpc/xprt.c
> > > > > > +++ b/net/sunrpc/xprt.c
> > > > > > @@ -844,6 +844,47 @@ struct rpc_rqst
> > > > > > *xprt_lookup_rqst(struct
> > > > > > rpc_xprt *xprt, __be32 xid)
> > > > > > }
> > > > > > EXPORT_SYMBOL_GPL(xprt_lookup_rqst);
> > > > > > 
> > > > > > +/**
> > > > > > + * xprt_pin_rqst - Pin a request on the transport receive
> > > > > > list
> > > > > > + * @req: Request to pin
> > > > > > + *
> > > > > > + * Caller must ensure this is atomic with the call to
> > > > > > xprt_lookup_rqst()
> > > > > > + * so should be holding the xprt transport lock.
> > > > > > + */
> > > > > > +void xprt_pin_rqst(struct rpc_rqst *req)
> > > > > > +{
> > > > > > +	set_bit(RPC_TASK_MSG_RECV, &req->rq_task-
> > > > > > > tk_runstate);
> > > > > > 
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * xprt_unpin_rqst - Unpin a request on the transport
> > > > > > receive
> > > > > > list
> > > > > > + * @req: Request to pin
> > > > > > + *
> > > > > > + * Caller should be holding the xprt transport lock.
> > > > > > + */
> > > > > > +void xprt_unpin_rqst(struct rpc_rqst *req)
> > > > > > +{
> > > > > > +	struct rpc_task *task = req->rq_task;
> > > > > > +
> > > > > > +	clear_bit(RPC_TASK_MSG_RECV, &task->tk_runstate);
> > > > > > +	if (test_bit(RPC_TASK_MSG_RECV_WAIT, &task-
> > > > > > > tk_runstate))
> > > > > > 
> > > > > > +		wake_up_bit(&task->tk_runstate,
> > > > > > RPC_TASK_MSG_RECV);
> > > > > > +}
> > > > > > +
> > > > > > +static void xprt_wait_on_pinned_rqst(struct rpc_rqst *req)
> > > > > > +__must_hold(&req->rq_xprt->transport_lock)
> > > > > > +{
> > > > > > +	struct rpc_task *task = req->rq_task;
> > > > > > +	if (test_bit(RPC_TASK_MSG_RECV, &task-
> > > > > > >tk_runstate)) {
> > > > > > +		spin_unlock_bh(&req->rq_xprt-
> > > > > > >transport_lock);
> > > > > > +		set_bit(RPC_TASK_MSG_RECV_WAIT, &task-
> > > > > > > tk_runstate);
> > > > > > 
> > > > > > +		wait_on_bit(&task->tk_runstate,
> > > > > > RPC_TASK_MSG_RECV,
> > > > > > +				TASK_UNINTERRUPTIBLE);
> > > > > > +		clear_bit(RPC_TASK_MSG_RECV_WAIT, &task-
> > > > > > > tk_runstate);
> > > > > > 
> > > > > > +		spin_lock_bh(&req->rq_xprt-
> > > > > > >transport_lock);
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > static void xprt_update_rtt(struct rpc_task *task)
> > > > > > {
> > > > > > 	struct rpc_rqst *req = task->tk_rqstp;
> > > > > > @@ -1301,6 +1342,7 @@ void xprt_release(struct rpc_task
> > > > > > *task)
> > > > > > 		list_del(&req->rq_list);
> > > > > > 	xprt->last_used = jiffies;
> > > > > > 	xprt_schedule_autodisconnect(xprt);
> > > > > > +	xprt_wait_on_pinned_rqst(req);
> > > > > > 	spin_unlock_bh(&xprt->transport_lock);
> > > > > 
> > > > > Is it OK to call wait_on_bit(TASK_UNINTERRUPTIBLE) while
> > > > > holding
> > > > > a BH spin lock? This could be prone to deadlock.
> > > > 
> > > > We drop the lock inside xprt_wait_on_pinned_rqst() if we need
> > > > to
> > > > wait.
> > > > 
> > > > The reason why we want to hold the lock there is to avoid a
> > > > use-
> > > > after-
> > > > free in xprt_unpin_rqst(). Without the lock, there is a risk
> > > > that
> > > > xprt_release() could complete, and the task get freed before we
> > > > have
> > > > completed wake_up_bit().
> > > 
> > > Agree with the last bit. I had that challenge with the order of
> > > memory invalidation (which waits) and
> > > xprt_lookup_rqst/complete_rqst
> > > in rpcrdma_reply_handler.
> > > 
> > > However my concern is that there are many other users of the
> > > transport_lock. Does it become more tricky not to introduce a
> > > problem by changing any one of the sites that take
> > > transport_lock?
> > 
> > What is your concern? The main reason for taking the transport lock
> > in
> > the current code is twofold:
> > 
> > 1) Perform the lookup of the request on the receive list
> > 2) Ensure the request doesn't disappear while we're writing to it.
> 
> Yes, I think we see the same problem space.
> 
> I missed, though, that xprt_wait_on_pinned_rqst always releases
> transport_lock before invoking wait_on_bit.
> 
> 
> > > Despite the deadlock concern, I don't think it makes sense to
> > > call
> > > wait_on_bit while holding a spin lock simply because spin locks
> > > are
> > > supposed to be for things that are quick, like updating a data
> > > structure.
> > > 
> > > wait_on_bit can take microseconds: prepare_to_wait and
> > > finish_wait
> > > can both take a irqsave spin_lock, for example. bit_wait is the
> > > action that is done if the bit is not ready, and that calls
> > > schedule(). On a busy system, that kind of context switch can
> > > take
> > > dozens of microseconds.
> > 
> > Agreed, but we should only very rarely be hitting the wait case. In
> > all
> > successful RPC calls, the receive will have happened long before we
> > call xprt_release(). In fact, the only 2 cases where this can
> > happen
> > would be when the RPC call itself is interrupted. Either:
> > 
> > 1) The task is synchronous and was interrupted by the user.
> > 2) The task is terminating early due to a soft timeout.
> > 
> > In both those cases, we're in a non-performance path. Furthermore,
> > they
> > would currently end up spinning against the transport lock.
> 
> OK, no argument that there is any performance concern.
> 
> I can hit this case 2 or 3 times out of 5 with generic/029 on
> NFSv4/RDMA. These days, the worse that happens is the client drops
> the RDMA connection because I've spent a very long time immersed
> in these code paths, trying to make the two cases you list above
> work 100% without deadlock or crash. See commits 04d25b7d5d1b
> through 431af645cf66, and commit 68791649a725.
> 
> It makes me uncertain that waiting for anything in xprt_release
> is safe, even if the transport_lock is not being held. xprtrdma
> used to perform synchronous memory invalidation in ->buf_free.
> It doesn't any more for this reason.


Full disclosure: I didn't actually enable this code for RDMA. The
reason is firstly that I won't pretend to understand the locking in
rpcrdma_reply_handler(), but furthermore that RDMA doesn't really have
a problem with bulk copy in the same way that the socket based code
does: the direct placement into the reply buffer means the bulk of the
code in rpcrdma_reply_handler() is all about checking the alignment
(with, admittedly some scary bits in rpcrdma_inline_fixup()).



> > > > > > IMO it would be a lot nicer if we had an FSM state
> > > > > > available that
> > > could allow another sleep while an RPC task during xprt_release.
> > 
> > Why would that be preferable, if the argument holds that this only
> > happens in these 2 rare error cases?
> 
> Because FSM steps are the way this code manages sleeps. That
> makes it easier to understand and less brittle. I don't think
> it would be more overhead than test_bit().

The problem is that it would add new states to a number of functions
that currently call xprt_release(). Right now, that would mean

call_reserveresult()
rpc_verify_header()
rpc_exit_task()

In addition, we'd need a separate solution for rpc_put_task(), and
probably also for rpc_release_task() (since these are both called from
outside the main state machine loop).

I'm getting a little worried about your reply above: the RPC engine
offers no real-time guarantees, so if the RDMA code is really this
latency-sensitive, then what will happen if we enable pre-emption
and/or we just have a very congested workqueue?
Also, how would a state machine solution help? I'd expect that to make
the latency worse not better, since the work item would end up getting
requeued. If the worker thread then gets rescheduled as well...

One solution may be to simply say we're never going to use this
mechanism for RDMA. I suspect that if we just replace the use of xprt-
>transport_lock when protecting the receive queue with a new lock that
wouldn't need to be bh-safe then that might suffice to deal with the
RDMA latency issues. I have less confidence that would work for the
others due to the bulk data copy issue.

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux