Re: [PATCH v3] SUNRPC: Fix a race in the receive code path

Chuck Lever <chuck.lever@xxxxxxxxxx> · Wed, 13 Dec 2017 20:13:09 -0500

> On Dec 13, 2017, at 6:42 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
> 
>> 
>> On Dec 13, 2017, at 11:14 AM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>> 
>> Hi Trond-
>> 
>>> On Dec 3, 2017, at 7:17 PM, Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> wrote:
>>> 
>>> We must ensure that the call to rpc_sleep_on() in xprt_transmit() cannot
>>> race with the call to xprt_complete_rqst().
>>> 
>>> Reported-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=317
>>> Fixes: ce7c252a8c74 ("SUNRPC: Add a separate spinlock to protect..")
>>> Cc: stable@xxxxxxxxxxxxxxx # 4.14+
>>> Reviewed-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>> Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
>>> ---
>>> net/sunrpc/xprt.c | 26 +++++++++++++++++---------
>>> 1 file changed, 17 insertions(+), 9 deletions(-)
>>> 
>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>> index 333b9d697ae5..5e4278e9ce37 100644
>>> --- a/net/sunrpc/xprt.c
>>> +++ b/net/sunrpc/xprt.c
>>> @@ -1024,6 +1024,7 @@ void xprt_transmit(struct rpc_task *task)
>>> 	} else if (!req->rq_bytes_sent)
>>> 		return;
>>> 
>>> +	req->rq_connect_cookie = xprt->connect_cookie;
>>> 	req->rq_xtime = ktime_get();
>>> 	status = xprt->ops->send_request(task);
>>> 	trace_xprt_transmit(xprt, req->rq_xid, status);
>>> @@ -1047,20 +1048,27 @@ void xprt_transmit(struct rpc_task *task)
>>> 	xprt->stat.bklog_u += xprt->backlog.qlen;
>>> 	xprt->stat.sending_u += xprt->sending.qlen;
>>> 	xprt->stat.pending_u += xprt->pending.qlen;
>>> +	spin_unlock_bh(&xprt->transport_lock);
>>> 
>>> -	/* Don't race with disconnect */
>>> -	if (!xprt_connected(xprt))
>>> -		task->tk_status = -ENOTCONN;
>>> -	else {
>>> +	if (rpc_reply_expected(task) && !READ_ONCE(req->rq_reply_bytes_recvd)) {
>>> 		/*
>>> -		 * Sleep on the pending queue since
>>> -		 * we're expecting a reply.
>>> +		 * Sleep on the pending queue if we're expecting a reply.
>>> +		 * The spinlock ensures atomicity between the test of
>>> +		 * req->rq_reply_bytes_recvd, and the call to rpc_sleep_on().
>>> 		 */
>>> -		if (!req->rq_reply_bytes_recvd && rpc_reply_expected(task))
>>> +		spin_lock(&xprt->recv_lock);
>>> +		if (!req->rq_reply_bytes_recvd) {
>>> 			rpc_sleep_on(&xprt->pending, task, xprt_timer);
>>> -		req->rq_connect_cookie = xprt->connect_cookie;
>>> +			/*
>>> +			 * Send an extra queue wakeup call if the
>>> +			 * connection was dropped in case the call to
>>> +			 * rpc_sleep_on() raced.
>>> +			 */
>>> +			if (!xprt_connected(xprt))
>>> +				xprt_wake_pending_tasks(xprt, -ENOTCONN);
>>> +		}
>>> +		spin_unlock(&xprt->recv_lock);
>>> 	}
>>> -	spin_unlock_bh(&xprt->transport_lock);
>>> }
>>> 
>>> static void xprt_add_backlog(struct rpc_xprt *xprt, struct rpc_task *task)
>>> -- 
>>> 2.14.3
>> 
>> I've run into a problem with this version of the patch (on v4.15-rc3).
>> 
>> With dbench on NFSv4.0 (on RDMA) I no longer see constant throughput
>> and a temporarily climbing latency followed by a recovery, but this
>> can happen on occasion:
>> 
>> releasing clients
>> 16       584   322.55 MB/sec  warmup   1 sec  latency 807.121 ms
>> 16       584   161.28 MB/sec  warmup   2 sec  latency 1807.175 ms
>> 16       584   107.52 MB/sec  warmup   3 sec  latency 2807.198 ms
>> 16       584    80.64 MB/sec  warmup   4 sec  latency 3807.217 ms
>> 16       584    64.51 MB/sec  warmup   5 sec  latency 4807.235 ms
>> 16       584    53.76 MB/sec  warmup   6 sec  latency 5807.251 ms
>> 16       584    46.08 MB/sec  warmup   7 sec  latency 6807.269 ms
>> 16       584    40.32 MB/sec  warmup   8 sec  latency 7807.285 ms
>> 16       584    35.84 MB/sec  warmup   9 sec  latency 8807.303 ms
>> 16       584    32.26 MB/sec  warmup  10 sec  latency 9807.318 ms
>> 16       584    29.32 MB/sec  warmup  11 sec  latency 10807.334 ms
>> 16       584    26.88 MB/sec  warmup  12 sec  latency 11807.355 ms
>> 16       584    24.81 MB/sec  warmup  13 sec  latency 12807.372 ms
>> 16       584    23.04 MB/sec  warmup  14 sec  latency 13807.387 ms
>> 16       584    21.50 MB/sec  warmup  15 sec  latency 14807.406 ms
>> 16       584    20.16 MB/sec  warmup  16 sec  latency 15807.423 ms
>> 
>> No recovery. The latency number keeps climbing, and throughput
>> drops.
>> 
>> "kernel: nfs: server klimt-ib not responding, still trying"
>> appears in the client's /var/log/messages.
>> 
>> Without this patch applied, the NFSv4.0 behavior is the same as I
>> reported with NFSv3: every once in a while, one RPC completion is
>> lost, but when the timer fires, the client notices the reply
>> actually did arrive and the RPC completes normally.
>> 
>> I'm looking into it.
> 
> Root cause:
> 
> The transport is temporarily out of resources and returns -ENOBUFS
> from ->send_request.
> 
> The FSM schedules a delay then calls call_transmit again, which
> invokes xprt_prepare_transmit.
> 
> On NFSv3 mounts, xprt_prepare_transmit tries to acquire the
> transport write lock.
> 
> On NFSv4 mounts, RPC_TASK_NO_RETRANS_TIMEOUT is set. This causes
> xprt_prepare_transmit to see that the transport is connected and
> rq_connect_cookie is the same as xprt->connect_cookie, so it
> queues the task on ->pending and returns false.
> 
> The transport never sends the Call, but still waits for a Reply,
> forever. This is not a bug in your patch, it is a pre-existing
> issue.

Sorry for one more reply...

I think this last statement is incorrect. It is a bug in your
patch after all.

I forgot that your patch moves the line that sets

    req->rq_connect_cookie = xprt->connect_cookie;

to before the call to ->send_request. But rq_connect_cookie
must be set iff ->send_request is successful. Moving this
line seems to fix the issue:

1027         req->rq_xtime = ktime_get();
1028         status = xprt->ops->send_request(task);
1029         trace_xprt_transmit(xprt, req->rq_xid, status);
1030         if (status != 0) {
1031                 task->tk_status = status;
1032                 return;
1033         }
1034         req->rq_connect_cookie = xprt->connect_cookie; <<<<
1035         xprt_inject_disconnect(xprt);

> There's a simple fix, which is to set
> 
>   rq_connect_cookie = xprt->connect_cookie - 1;
> 
> if RPC-over-RDMA header marshaling fails. This is a one-line fix,
> appropriate for -rc and backport to stable (to at least v4.8).

This fix does exactly the same thing.

> A more robust fix might be to have ->send_request return -EAGAIN
> instead, and let xprtrdma call xprt_write_space when the resource
> exhaustion is relieved. That might be more involved, and thus is
> future-facing.
> 
> I'm investigating both fixes.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html