Re: [PATCH 1/3] SUNRPC: Don't hold the transport lock across socket copy operations

Chuck Lever <chuck.lever@xxxxxxxxxx> · Mon, 14 Aug 2017 18:25:14 -0400

> On Aug 14, 2017, at 4:38 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> On Mon, 2017-08-14 at 16:23 -0400, Chuck Lever wrote:
>>> On Aug 14, 2017, at 4:07 PM, Trond Myklebust <trondmy@primarydata.c
>>> om> wrote:
>>> 
>>> On Mon, 2017-08-14 at 15:28 -0400, Chuck Lever wrote:
>>>>> On Aug 14, 2017, at 3:16 PM, Trond Myklebust <trond.myklebust@p
>>>>> rima
>>>>> rydata.com> wrote:
>>>>> 
>>>>> Instead add a mechanism to ensure that the request doesn't
>>>>> disappear
>>>>> from underneath us while copying from the socket. We do this by
>>>>> preventing xprt_release() from freeing the XDR buffers until
>>>>> the
>>>>> flag RPC_TASK_MSG_RECV has been cleared from the request.
>>>>> 
>>>>> Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx
>>>>>> 
>>>>> ---
>>>>> include/linux/sunrpc/sched.h |  2 ++
>>>>> include/linux/sunrpc/xprt.h  |  2 ++
>>>>> net/sunrpc/xprt.c            | 42
>>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>> net/sunrpc/xprtsock.c        | 27 ++++++++++++++++++++++-----
>>>>> 4 files changed, 68 insertions(+), 5 deletions(-)
>>>>> 
>>>>> diff --git a/include/linux/sunrpc/sched.h
>>>>> b/include/linux/sunrpc/sched.h
>>>>> index 15bc1cd6ee5c..e972d9e05426 100644
>>>>> --- a/include/linux/sunrpc/sched.h
>>>>> +++ b/include/linux/sunrpc/sched.h
>>>>> @@ -141,6 +141,8 @@ struct rpc_task_setup {
>>>>> #define RPC_TASK_ACTIVE		2
>>>>> #define RPC_TASK_MSG_XMIT	3
>>>>> #define RPC_TASK_MSG_XMIT_WAIT	4
>>>>> +#define RPC_TASK_MSG_RECV	5
>>>>> +#define RPC_TASK_MSG_RECV_WAIT	6
>>>>> 
>>>>> #define RPC_IS_RUNNING(t)	test_bit(RPC_TASK_RUNNING,
>>>>> &(t)-
>>>>>> tk_runstate)
>>>>> 
>>>>> #define rpc_set_running(t)	set_bit(RPC_TASK_RUNNING,
>>>>> &(t)-
>>>>>> tk_runstate)
>>>>> 
>>>>> diff --git a/include/linux/sunrpc/xprt.h
>>>>> b/include/linux/sunrpc/xprt.h
>>>>> index eab1c749e192..65b9e0224753 100644
>>>>> --- a/include/linux/sunrpc/xprt.h
>>>>> +++ b/include/linux/sunrpc/xprt.h
>>>>> @@ -372,6 +372,8 @@ void			xprt_write_spac
>>>>> e(st
>>>>> ruct rpc_xprt *xprt);
>>>>> void			xprt_adjust_cwnd(struct rpc_xprt
>>>>> *xprt,
>>>>> struct rpc_task *task, int result);
>>>>> struct rpc_rqst *	xprt_lookup_rqst(struct rpc_xprt
>>>>> *xprt,
>>>>> __be32 xid);
>>>>> void			xprt_complete_rqst(struct rpc_task
>>>>> *task, int copied);
>>>>> +void			xprt_pin_rqst(struct rpc_rqst
>>>>> *req);
>>>>> +void			xprt_unpin_rqst(struct rpc_rqst
>>>>> *req);
>>>>> void			xprt_release_rqst_cong(struct
>>>>> rpc_task
>>>>> *task);
>>>>> void			xprt_disconnect_done(struct
>>>>> rpc_xprt
>>>>> *xprt);
>>>>> void			xprt_force_disconnect(struct
>>>>> rpc_xprt
>>>>> *xprt);
>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>>> index 788c1b6138c2..62c379865f7c 100644
>>>>> --- a/net/sunrpc/xprt.c
>>>>> +++ b/net/sunrpc/xprt.c
>>>>> @@ -844,6 +844,47 @@ struct rpc_rqst *xprt_lookup_rqst(struct
>>>>> rpc_xprt *xprt, __be32 xid)
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(xprt_lookup_rqst);
>>>>> 
>>>>> +/**
>>>>> + * xprt_pin_rqst - Pin a request on the transport receive list
>>>>> + * @req: Request to pin
>>>>> + *
>>>>> + * Caller must ensure this is atomic with the call to
>>>>> xprt_lookup_rqst()
>>>>> + * so should be holding the xprt transport lock.
>>>>> + */
>>>>> +void xprt_pin_rqst(struct rpc_rqst *req)
>>>>> +{
>>>>> +	set_bit(RPC_TASK_MSG_RECV, &req->rq_task-
>>>>>> tk_runstate);
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * xprt_unpin_rqst - Unpin a request on the transport receive
>>>>> list
>>>>> + * @req: Request to pin
>>>>> + *
>>>>> + * Caller should be holding the xprt transport lock.
>>>>> + */
>>>>> +void xprt_unpin_rqst(struct rpc_rqst *req)
>>>>> +{
>>>>> +	struct rpc_task *task = req->rq_task;
>>>>> +
>>>>> +	clear_bit(RPC_TASK_MSG_RECV, &task->tk_runstate);
>>>>> +	if (test_bit(RPC_TASK_MSG_RECV_WAIT, &task-
>>>>>> tk_runstate))
>>>>> +		wake_up_bit(&task->tk_runstate,
>>>>> RPC_TASK_MSG_RECV);
>>>>> +}
>>>>> +
>>>>> +static void xprt_wait_on_pinned_rqst(struct rpc_rqst *req)
>>>>> +__must_hold(&req->rq_xprt->transport_lock)
>>>>> +{
>>>>> +	struct rpc_task *task = req->rq_task;
>>>>> +	if (test_bit(RPC_TASK_MSG_RECV, &task->tk_runstate)) {
>>>>> +		spin_unlock_bh(&req->rq_xprt->transport_lock);
>>>>> +		set_bit(RPC_TASK_MSG_RECV_WAIT, &task-
>>>>>> tk_runstate);
>>>>> 
>>>>> +		wait_on_bit(&task->tk_runstate,
>>>>> RPC_TASK_MSG_RECV,
>>>>> +				TASK_UNINTERRUPTIBLE);
>>>>> +		clear_bit(RPC_TASK_MSG_RECV_WAIT, &task-
>>>>>> tk_runstate);
>>>>> 
>>>>> +		spin_lock_bh(&req->rq_xprt->transport_lock);
>>>>> +	}
>>>>> +}
>>>>> +
>>>>> static void xprt_update_rtt(struct rpc_task *task)
>>>>> {
>>>>> 	struct rpc_rqst *req = task->tk_rqstp;
>>>>> @@ -1301,6 +1342,7 @@ void xprt_release(struct rpc_task *task)
>>>>> 		list_del(&req->rq_list);
>>>>> 	xprt->last_used = jiffies;
>>>>> 	xprt_schedule_autodisconnect(xprt);
>>>>> +	xprt_wait_on_pinned_rqst(req);
>>>>> 	spin_unlock_bh(&xprt->transport_lock);
>>>> 
>>>> Is it OK to call wait_on_bit(TASK_UNINTERRUPTIBLE) while holding
>>>> a BH spin lock? This could be prone to deadlock.
>>> 
>>> We drop the lock inside xprt_wait_on_pinned_rqst() if we need to
>>> wait.
>>> 
>>> The reason why we want to hold the lock there is to avoid a use-
>>> after-
>>> free in xprt_unpin_rqst(). Without the lock, there is a risk that
>>> xprt_release() could complete, and the task get freed before we
>>> have
>>> completed wake_up_bit().
>> 
>> Agree with the last bit. I had that challenge with the order of
>> memory invalidation (which waits) and xprt_lookup_rqst/complete_rqst
>> in rpcrdma_reply_handler.
>> 
>> However my concern is that there are many other users of the
>> transport_lock. Does it become more tricky not to introduce a
>> problem by changing any one of the sites that take transport_lock?
> 
> What is your concern? The main reason for taking the transport lock in
> the current code is twofold:
> 
> 1) Perform the lookup of the request on the receive list
> 2) Ensure the request doesn't disappear while we're writing to it.

Yes, I think we see the same problem space.

I missed, though, that xprt_wait_on_pinned_rqst always releases
transport_lock before invoking wait_on_bit.

>> Despite the deadlock concern, I don't think it makes sense to call
>> wait_on_bit while holding a spin lock simply because spin locks are
>> supposed to be for things that are quick, like updating a data
>> structure.
>> 
>> wait_on_bit can take microseconds: prepare_to_wait and finish_wait
>> can both take a irqsave spin_lock, for example. bit_wait is the
>> action that is done if the bit is not ready, and that calls
>> schedule(). On a busy system, that kind of context switch can take
>> dozens of microseconds.
> 
> Agreed, but we should only very rarely be hitting the wait case. In all
> successful RPC calls, the receive will have happened long before we
> call xprt_release(). In fact, the only 2 cases where this can happen
> would be when the RPC call itself is interrupted. Either:
> 
> 1) The task is synchronous and was interrupted by the user.
> 2) The task is terminating early due to a soft timeout.
> 
> In both those cases, we're in a non-performance path. Furthermore, they
> would currently end up spinning against the transport lock.

OK, no argument that there is any performance concern.

I can hit this case 2 or 3 times out of 5 with generic/029 on
NFSv4/RDMA. These days, the worse that happens is the client drops
the RDMA connection because I've spent a very long time immersed
in these code paths, trying to make the two cases you list above
work 100% without deadlock or crash. See commits 04d25b7d5d1b
through 431af645cf66, and commit 68791649a725.

It makes me uncertain that waiting for anything in xprt_release
is safe, even if the transport_lock is not being held. xprtrdma
used to perform synchronous memory invalidation in ->buf_free.
It doesn't any more for this reason.

>> IMO it would be a lot nicer if we had an FSM state available that
>> could allow another sleep while an RPC task during xprt_release.
> 
> Why would that be preferable, if the argument holds that this only
> happens in these 2 rare error cases?

Because FSM steps are the way this code manages sleeps. That
makes it easier to understand and less brittle. I don't think
it would be more overhead than test_bit().

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html