Re: [PATCH] SUNRPC: Fix a race in the receive code path

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Mon, 4 Dec 2017 00:00:04 +0000

On Sun, 2017-12-03 at 18:33 -0500, Chuck Lever wrote:
> > On Dec 3, 2017, at 3:24 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxx
> > m> wrote:
> > 
> > On Sun, 2017-12-03 at 15:19 -0500, Chuck Lever wrote:
> > > > On Dec 3, 2017, at 3:12 PM, Trond Myklebust <trondmy@primarydat
> > > > a.co
> > > > m> wrote:
> > > > 
> > > > On Sun, 2017-12-03 at 13:54 -0500, Chuck Lever wrote:
> > > > > > On Dec 3, 2017, at 1:50 PM, Trond Myklebust <trond.myklebus
> > > > > > t@pr
> > > > > > imar
> > > > > > ydata.com> wrote:
> > > > > > 
> > > > > > We must ensure that the call to rpc_sleep_on() in
> > > > > > xprt_transmit()
> > > > > > cannot
> > > > > > race with the call to xprt_complete_rqst().
> > > > > 
> > > > > :-( this will kill scalability, we might as well just go back
> > > > > to the old locking scheme.
> > > > 
> > > > It shouldn't make a huge difference, but I agree that we do
> > > > want to
> > > > get
> > > > rid of that transport lock.
> > > > 
> > > > How about the following, then?
> > > 
> > > This looks better, I'll give it a try!
> 
> After testing for several hours, I was not able to reproduce the
> lost RPC completion symptom.
> 
> I've been concerned about the performance impact. 8-thread dbench
> throughput regresses 2-3% on my 56Gb/s network. There will probably
> not be any noticeable degradation on a slower fabric.
> 
> Tested-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> 
> 
> > > 
> > > But touching the recv_lock in the transmit path is what I'd like
> > > to avoid completely, if possible. I'm not clear on why the
> > > pending
> > > waitqueue is unprotected. Doesn't it have a lock of its own?
> > 
> > The recv_lock ensures that the check of req->rq_reply_bytes_recvd
> > is
> > atomic with the call to rpc_sleep_on().
> 
> Ah, makes sense. I agree it is an oversight in ce7c252a8c7 not to
> have made this change at the same time.
> 
> > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> > Date: Sun, 3 Dec 2017 13:37:27 -0500
> > Subject: [PATCH v2] SUNRPC: Fix a race in the receive code path
> > 
> > We must ensure that the call to rpc_sleep_on() in xprt_transmit()
> > cannot
> > race with the call to xprt_complete_rqst().
> 
> IMHO the patch description could mention rq_reply_bytes_recvd, which
> is the data that is protected by the recv_lock. More bread crumbs
> for our future selves...
> 
I'll put a comment in the code itself. That makes the code easier to
review.

> 
> > @@ -1048,19 +1049,22 @@ void xprt_transmit(struct rpc_task *task)
> > 	xprt->stat.sending_u += xprt->sending.qlen;
> > 	xprt->stat.pending_u += xprt->pending.qlen;
> > 
> > -	/* Don't race with disconnect */
> > -	if (!xprt_connected(xprt))
> > -		task->tk_status = -ENOTCONN;
> > -	else {
> > +	spin_unlock_bh(&xprt->transport_lock);
> > +
> > +	if (!READ_ONCE(req->rq_reply_bytes_recvd) &&
> > rpc_reply_expected(task)) {
> 
> I was wondering about this optimization. It seems to provide about a
> 1% benefit in dbench throughput results in my setup, though it is
> certainly not a common case that the reply has arrived at this point.

It's not too likely an occurrence for most setups, but it costs little
to do that check, and it does mean that we can optimise away the case
of RPC calls that expect no reply.

> > +		spin_lock(&xprt->recv_lock);
> > 		/*
> > 		 * Sleep on the pending queue since
> > 		 * we're expecting a reply.
> > 		 */
> > -		if (!req->rq_reply_bytes_recvd &&
> > rpc_reply_expected(task))
> > +		if (!req->rq_reply_bytes_recvd) {
> > 			rpc_sleep_on(&xprt->pending, task, xprt_timer);
> > -		req->rq_connect_cookie = xprt->connect_cookie;
> > +			/* Deal with disconnect races */
> > +			if (!xprt_connected(xprt))
> > +				xprt_wake_pending_tasks(xprt,
> > -ENOTCONN);
> 
> Here, the !xprt_connected test is no longer done unconditionally,
> but only when the task has to be put to sleep. Is that a 100% safe
> change?

Yes. We only care about the race case for this particular RPC call. Any
other RPCs that are already waiting on xprt->pending will be woken by
the transport layer itself.

> I'm also wondering if this check can be omitted. Won't disconnect
> wait until xprt_release_xprt ?

If there are no other RPC calls pending to drive a reconnect, then we
can end up waiting for this RPC call to time out if we race here.

> 
> Otherwise if this check is still needed, the new comment could be
> a little more explanatory :-)
> 
> 
> > +		}
> > +		spin_unlock(&xprt->recv_lock);
> > 	}
> > -	spin_unlock_bh(&xprt->transport_lock);
> > }
> 
> 
> Reviewed-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> 
> And thanks for your quick response!
> 
> 
> --
> Chuck Lever
> 
> 
> 
-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥