Re: [PATCH v3] SUNRPC: Fix a race in the receive code path

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 14 Dec 2017 10:49:21 -0500

> On Dec 14, 2017, at 7:16 AM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> On Wed, 2017-12-13 at 20:13 -0500, Chuck Lever wrote:
>>> On Dec 13, 2017, at 6:42 PM, Chuck Lever <chuck.lever@xxxxxxxxxx>
>>> wrote:
>>> 
>>>> 
>>>> On Dec 13, 2017, at 11:14 AM, Chuck Lever <chuck.lever@xxxxxxxxxx
>>>>> wrote:
>>>> 
>>>> Hi Trond-
>>>> 
>>>>> On Dec 3, 2017, at 7:17 PM, Trond Myklebust <trond.myklebust@pr
>>>>> imarydata.com> wrote:
>>>>> 
>>>>> We must ensure that the call to rpc_sleep_on() in
>>>>> xprt_transmit() cannot
>>>>> race with the call to xprt_complete_rqst().
>>>>> 
>>>>> Reported-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>>> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=317
>>>>> Fixes: ce7c252a8c74 ("SUNRPC: Add a separate spinlock to
>>>>> protect..")
>>>>> Cc: stable@xxxxxxxxxxxxxxx # 4.14+
>>>>> Reviewed-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>>> Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx
>>>>>> 
>>>>> ---
>>>>> net/sunrpc/xprt.c | 26 +++++++++++++++++---------
>>>>> 1 file changed, 17 insertions(+), 9 deletions(-)
>>>>> 
>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>>> index 333b9d697ae5..5e4278e9ce37 100644
>>>>> --- a/net/sunrpc/xprt.c
>>>>> +++ b/net/sunrpc/xprt.c
>>>>> @@ -1024,6 +1024,7 @@ void xprt_transmit(struct rpc_task *task)
>>>>> 	} else if (!req->rq_bytes_sent)
>>>>> 		return;
>>>>> 
>>>>> +	req->rq_connect_cookie = xprt->connect_cookie;
>>>>> 	req->rq_xtime = ktime_get();
>>>>> 	status = xprt->ops->send_request(task);
>>>>> 	trace_xprt_transmit(xprt, req->rq_xid, status);
>>>>> @@ -1047,20 +1048,27 @@ void xprt_transmit(struct rpc_task
>>>>> *task)
>>>>> 	xprt->stat.bklog_u += xprt->backlog.qlen;
>>>>> 	xprt->stat.sending_u += xprt->sending.qlen;
>>>>> 	xprt->stat.pending_u += xprt->pending.qlen;
>>>>> +	spin_unlock_bh(&xprt->transport_lock);
>>>>> 
>>>>> -	/* Don't race with disconnect */
>>>>> -	if (!xprt_connected(xprt))
>>>>> -		task->tk_status = -ENOTCONN;
>>>>> -	else {
>>>>> +	if (rpc_reply_expected(task) && !READ_ONCE(req-
>>>>>> rq_reply_bytes_recvd)) {
>>>>> 		/*
>>>>> -		 * Sleep on the pending queue since
>>>>> -		 * we're expecting a reply.
>>>>> +		 * Sleep on the pending queue if we're
>>>>> expecting a reply.
>>>>> +		 * The spinlock ensures atomicity between the
>>>>> test of
>>>>> +		 * req->rq_reply_bytes_recvd, and the call to
>>>>> rpc_sleep_on().
>>>>> 		 */
>>>>> -		if (!req->rq_reply_bytes_recvd &&
>>>>> rpc_reply_expected(task))
>>>>> +		spin_lock(&xprt->recv_lock);
>>>>> +		if (!req->rq_reply_bytes_recvd) {
>>>>> 			rpc_sleep_on(&xprt->pending, task,
>>>>> xprt_timer);
>>>>> -		req->rq_connect_cookie = xprt->connect_cookie;
>>>>> +			/*
>>>>> +			 * Send an extra queue wakeup call if
>>>>> the
>>>>> +			 * connection was dropped in case the
>>>>> call to
>>>>> +			 * rpc_sleep_on() raced.
>>>>> +			 */
>>>>> +			if (!xprt_connected(xprt))
>>>>> +				xprt_wake_pending_tasks(xprt,
>>>>> -ENOTCONN);
>>>>> +		}
>>>>> +		spin_unlock(&xprt->recv_lock);
>>>>> 	}
>>>>> -	spin_unlock_bh(&xprt->transport_lock);
>>>>> }
>>>>> 
>>>>> static void xprt_add_backlog(struct rpc_xprt *xprt, struct
>>>>> rpc_task *task)
>>>>> -- 
>>>>> 2.14.3
>>>> 
>>>> I've run into a problem with this version of the patch (on v4.15-
>>>> rc3).
>>>> 
>>>> With dbench on NFSv4.0 (on RDMA) I no longer see constant
>>>> throughput
>>>> and a temporarily climbing latency followed by a recovery, but
>>>> this
>>>> can happen on occasion:
>>>> 
>>>> releasing clients
>>>> 16       584   322.55 MB/sec  warmup   1 sec  latency 807.121 ms
>>>> 16       584   161.28 MB/sec  warmup   2 sec  latency 1807.175 ms
>>>> 16       584   107.52 MB/sec  warmup   3 sec  latency 2807.198 ms
>>>> 16       584    80.64 MB/sec  warmup   4 sec  latency 3807.217 ms
>>>> 16       584    64.51 MB/sec  warmup   5 sec  latency 4807.235 ms
>>>> 16       584    53.76 MB/sec  warmup   6 sec  latency 5807.251 ms
>>>> 16       584    46.08 MB/sec  warmup   7 sec  latency 6807.269 ms
>>>> 16       584    40.32 MB/sec  warmup   8 sec  latency 7807.285 ms
>>>> 16       584    35.84 MB/sec  warmup   9 sec  latency 8807.303 ms
>>>> 16       584    32.26 MB/sec  warmup  10 sec  latency 9807.318 ms
>>>> 16       584    29.32 MB/sec  warmup  11 sec  latency 10807.334
>>>> ms
>>>> 16       584    26.88 MB/sec  warmup  12 sec  latency 11807.355
>>>> ms
>>>> 16       584    24.81 MB/sec  warmup  13 sec  latency 12807.372
>>>> ms
>>>> 16       584    23.04 MB/sec  warmup  14 sec  latency 13807.387
>>>> ms
>>>> 16       584    21.50 MB/sec  warmup  15 sec  latency 14807.406
>>>> ms
>>>> 16       584    20.16 MB/sec  warmup  16 sec  latency 15807.423
>>>> ms
>>>> 
>>>> No recovery. The latency number keeps climbing, and throughput
>>>> drops.
>>>> 
>>>> "kernel: nfs: server klimt-ib not responding, still trying"
>>>> appears in the client's /var/log/messages.
>>>> 
>>>> Without this patch applied, the NFSv4.0 behavior is the same as I
>>>> reported with NFSv3: every once in a while, one RPC completion is
>>>> lost, but when the timer fires, the client notices the reply
>>>> actually did arrive and the RPC completes normally.
>>>> 
>>>> I'm looking into it.
>>> 
>>> Root cause:
>>> 
>>> The transport is temporarily out of resources and returns -ENOBUFS
>>> from ->send_request.
>>> 
>>> The FSM schedules a delay then calls call_transmit again, which
>>> invokes xprt_prepare_transmit.
>>> 
>>> On NFSv3 mounts, xprt_prepare_transmit tries to acquire the
>>> transport write lock.
>>> 
>>> On NFSv4 mounts, RPC_TASK_NO_RETRANS_TIMEOUT is set. This causes
>>> xprt_prepare_transmit to see that the transport is connected and
>>> rq_connect_cookie is the same as xprt->connect_cookie, so it
>>> queues the task on ->pending and returns false.
>>> 
>>> The transport never sends the Call, but still waits for a Reply,
>>> forever. This is not a bug in your patch, it is a pre-existing
>>> issue.
>> 
>> Sorry for one more reply...
>> 
>> I think this last statement is incorrect. It is a bug in your
>> patch after all.
>> 
>> I forgot that your patch moves the line that sets
>> 
>>    req->rq_connect_cookie = xprt->connect_cookie;
>> 
>> to before the call to ->send_request. But rq_connect_cookie
>> must be set iff ->send_request is successful. Moving this
>> line seems to fix the issue:
>> 
>> 1027         req->rq_xtime = ktime_get();
>> 1028         status = xprt->ops->send_request(task);
>> 1029         trace_xprt_transmit(xprt, req->rq_xid, status);
>> 1030         if (status != 0) {
>> 1031                 task->tk_status = status;
>> 1032                 return;
>> 1033         }
>> 1034         req->rq_connect_cookie = xprt->connect_cookie; <<<<
>> 1035         xprt_inject_disconnect(xprt);
>> 
> 
> Let's rather apply the attached fix (incremental to the original patch
> and intended to be applied using 'git rebase --autosquash'). The reason
> for wanting to set the connect cookie before we call send_request() is
> to avoid races with disconnects that happen immediately after we sent
> the data (but before we get round to setting the cookie).
> 
> 8<--------------------------------------------------------
> From 326c12727b179d5c2fd42f374a4c0e2d56ee90c5 Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> Date: Thu, 14 Dec 2017 07:05:27 -0500
> Subject: [PATCH] fixup! SUNRPC: Fix a race in the receive code path
> 
> ---
> net/sunrpc/xprt.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index 5e4278e9ce37..2cf4cf21a878 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -1021,10 +1021,11 @@ void xprt_transmit(struct rpc_task *task)
> 			/* Turn off autodisconnect */
> 			del_singleshot_timer_sync(&xprt->timer);
> 		}
> +		if (!req->rq_bytes_sent)
> +			req->rq_connect_cookie = xprt->connect_cookie;
> 	} else if (!req->rq_bytes_sent)
> 		return;
> 
> -	req->rq_connect_cookie = xprt->connect_cookie;
> 	req->rq_xtime = ktime_get();
> 	status = xprt->ops->send_request(task);
> 	trace_xprt_transmit(xprt, req->rq_xid, status);
> -- 
> 2.14.3

Tested with a client that returns -ENOBUFS from ->send_request
with a little extra frequency.

This doesn't appear to address the problem. The transport
deadlocks as before when send_request() fails to send the RPC.

I imagine this issue could also impact write buffer exhaustion
on TCP.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html