Re: [PATCH v3] SUNRPC: Fix a race in the receive code path

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Thu, 14 Dec 2017 12:16:21 +0000

On Wed, 2017-12-13 at 20:13 -0500, Chuck Lever wrote:
> > On Dec 13, 2017, at 6:42 PM, Chuck Lever <chuck.lever@xxxxxxxxxx>
> > wrote:
> > 
> > > 
> > > On Dec 13, 2017, at 11:14 AM, Chuck Lever <chuck.lever@xxxxxxxxxx
> > > > wrote:
> > > 
> > > Hi Trond-
> > > 
> > > > On Dec 3, 2017, at 7:17 PM, Trond Myklebust <trond.myklebust@pr
> > > > imarydata.com> wrote:
> > > > 
> > > > We must ensure that the call to rpc_sleep_on() in
> > > > xprt_transmit() cannot
> > > > race with the call to xprt_complete_rqst().
> > > > 
> > > > Reported-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> > > > Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=317
> > > > Fixes: ce7c252a8c74 ("SUNRPC: Add a separate spinlock to
> > > > protect..")
> > > > Cc: stable@xxxxxxxxxxxxxxx # 4.14+
> > > > Reviewed-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> > > > Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx
> > > > >
> > > > ---
> > > > net/sunrpc/xprt.c | 26 +++++++++++++++++---------
> > > > 1 file changed, 17 insertions(+), 9 deletions(-)
> > > > 
> > > > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > > > index 333b9d697ae5..5e4278e9ce37 100644
> > > > --- a/net/sunrpc/xprt.c
> > > > +++ b/net/sunrpc/xprt.c
> > > > @@ -1024,6 +1024,7 @@ void xprt_transmit(struct rpc_task *task)
> > > > 	} else if (!req->rq_bytes_sent)
> > > > 		return;
> > > > 
> > > > +	req->rq_connect_cookie = xprt->connect_cookie;
> > > > 	req->rq_xtime = ktime_get();
> > > > 	status = xprt->ops->send_request(task);
> > > > 	trace_xprt_transmit(xprt, req->rq_xid, status);
> > > > @@ -1047,20 +1048,27 @@ void xprt_transmit(struct rpc_task
> > > > *task)
> > > > 	xprt->stat.bklog_u += xprt->backlog.qlen;
> > > > 	xprt->stat.sending_u += xprt->sending.qlen;
> > > > 	xprt->stat.pending_u += xprt->pending.qlen;
> > > > +	spin_unlock_bh(&xprt->transport_lock);
> > > > 
> > > > -	/* Don't race with disconnect */
> > > > -	if (!xprt_connected(xprt))
> > > > -		task->tk_status = -ENOTCONN;
> > > > -	else {
> > > > +	if (rpc_reply_expected(task) && !READ_ONCE(req-
> > > > >rq_reply_bytes_recvd)) {
> > > > 		/*
> > > > -		 * Sleep on the pending queue since
> > > > -		 * we're expecting a reply.
> > > > +		 * Sleep on the pending queue if we're
> > > > expecting a reply.
> > > > +		 * The spinlock ensures atomicity between the
> > > > test of
> > > > +		 * req->rq_reply_bytes_recvd, and the call to
> > > > rpc_sleep_on().
> > > > 		 */
> > > > -		if (!req->rq_reply_bytes_recvd &&
> > > > rpc_reply_expected(task))
> > > > +		spin_lock(&xprt->recv_lock);
> > > > +		if (!req->rq_reply_bytes_recvd) {
> > > > 			rpc_sleep_on(&xprt->pending, task,
> > > > xprt_timer);
> > > > -		req->rq_connect_cookie = xprt->connect_cookie;
> > > > +			/*
> > > > +			 * Send an extra queue wakeup call if
> > > > the
> > > > +			 * connection was dropped in case the
> > > > call to
> > > > +			 * rpc_sleep_on() raced.
> > > > +			 */
> > > > +			if (!xprt_connected(xprt))
> > > > +				xprt_wake_pending_tasks(xprt,
> > > > -ENOTCONN);
> > > > +		}
> > > > +		spin_unlock(&xprt->recv_lock);
> > > > 	}
> > > > -	spin_unlock_bh(&xprt->transport_lock);
> > > > }
> > > > 
> > > > static void xprt_add_backlog(struct rpc_xprt *xprt, struct
> > > > rpc_task *task)
> > > > -- 
> > > > 2.14.3
> > > 
> > > I've run into a problem with this version of the patch (on v4.15-
> > > rc3).
> > > 
> > > With dbench on NFSv4.0 (on RDMA) I no longer see constant
> > > throughput
> > > and a temporarily climbing latency followed by a recovery, but
> > > this
> > > can happen on occasion:
> > > 
> > > releasing clients
> > > 16       584   322.55 MB/sec  warmup   1 sec  latency 807.121 ms
> > > 16       584   161.28 MB/sec  warmup   2 sec  latency 1807.175 ms
> > > 16       584   107.52 MB/sec  warmup   3 sec  latency 2807.198 ms
> > > 16       584    80.64 MB/sec  warmup   4 sec  latency 3807.217 ms
> > > 16       584    64.51 MB/sec  warmup   5 sec  latency 4807.235 ms
> > > 16       584    53.76 MB/sec  warmup   6 sec  latency 5807.251 ms
> > > 16       584    46.08 MB/sec  warmup   7 sec  latency 6807.269 ms
> > > 16       584    40.32 MB/sec  warmup   8 sec  latency 7807.285 ms
> > > 16       584    35.84 MB/sec  warmup   9 sec  latency 8807.303 ms
> > > 16       584    32.26 MB/sec  warmup  10 sec  latency 9807.318 ms
> > > 16       584    29.32 MB/sec  warmup  11 sec  latency 10807.334
> > > ms
> > > 16       584    26.88 MB/sec  warmup  12 sec  latency 11807.355
> > > ms
> > > 16       584    24.81 MB/sec  warmup  13 sec  latency 12807.372
> > > ms
> > > 16       584    23.04 MB/sec  warmup  14 sec  latency 13807.387
> > > ms
> > > 16       584    21.50 MB/sec  warmup  15 sec  latency 14807.406
> > > ms
> > > 16       584    20.16 MB/sec  warmup  16 sec  latency 15807.423
> > > ms
> > > 
> > > No recovery. The latency number keeps climbing, and throughput
> > > drops.
> > > 
> > > "kernel: nfs: server klimt-ib not responding, still trying"
> > > appears in the client's /var/log/messages.
> > > 
> > > Without this patch applied, the NFSv4.0 behavior is the same as I
> > > reported with NFSv3: every once in a while, one RPC completion is
> > > lost, but when the timer fires, the client notices the reply
> > > actually did arrive and the RPC completes normally.
> > > 
> > > I'm looking into it.
> > 
> > Root cause:
> > 
> > The transport is temporarily out of resources and returns -ENOBUFS
> > from ->send_request.
> > 
> > The FSM schedules a delay then calls call_transmit again, which
> > invokes xprt_prepare_transmit.
> > 
> > On NFSv3 mounts, xprt_prepare_transmit tries to acquire the
> > transport write lock.
> > 
> > On NFSv4 mounts, RPC_TASK_NO_RETRANS_TIMEOUT is set. This causes
> > xprt_prepare_transmit to see that the transport is connected and
> > rq_connect_cookie is the same as xprt->connect_cookie, so it
> > queues the task on ->pending and returns false.
> > 
> > The transport never sends the Call, but still waits for a Reply,
> > forever. This is not a bug in your patch, it is a pre-existing
> > issue.
> 
> Sorry for one more reply...
> 
> I think this last statement is incorrect. It is a bug in your
> patch after all.
> 
> I forgot that your patch moves the line that sets
> 
>     req->rq_connect_cookie = xprt->connect_cookie;
> 
> to before the call to ->send_request. But rq_connect_cookie
> must be set iff ->send_request is successful. Moving this
> line seems to fix the issue:
> 
> 1027         req->rq_xtime = ktime_get();
> 1028         status = xprt->ops->send_request(task);
> 1029         trace_xprt_transmit(xprt, req->rq_xid, status);
> 1030         if (status != 0) {
> 1031                 task->tk_status = status;
> 1032                 return;
> 1033         }
> 1034         req->rq_connect_cookie = xprt->connect_cookie; <<<<
> 1035         xprt_inject_disconnect(xprt);
> 

Let's rather apply the attached fix (incremental to the original patch
and intended to be applied using 'git rebase --autosquash'). The reason
for wanting to set the connect cookie before we call send_request() is
to avoid races with disconnects that happen immediately after we sent
the data (but before we get round to setting the cookie).

8<--------------------------------------------------------
>From 326c12727b179d5c2fd42f374a4c0e2d56ee90c5 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
Date: Thu, 14 Dec 2017 07:05:27 -0500
Subject: [PATCH] fixup! SUNRPC: Fix a race in the receive code path

---
 net/sunrpc/xprt.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 5e4278e9ce37..2cf4cf21a878 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1021,10 +1021,11 @@ void xprt_transmit(struct rpc_task *task)
 			/* Turn off autodisconnect */
 			del_singleshot_timer_sync(&xprt->timer);
 		}
+		if (!req->rq_bytes_sent)
+			req->rq_connect_cookie = xprt->connect_cookie;
 	} else if (!req->rq_bytes_sent)
 		return;
 
-	req->rq_connect_cookie = xprt->connect_cookie;
 	req->rq_xtime = ktime_get();
 	status = xprt->ops->send_request(task);
 	trace_xprt_transmit(xprt, req->rq_xid, status);
-- 
2.14.3

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥