> On Dec 14, 2017, at 2:03 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Thu, 2017-12-14 at 10:49 -0500, Chuck Lever wrote: >>> On Dec 14, 2017, at 7:16 AM, Trond Myklebust <trondmy@primarydata.c >>> om> wrote: >>> >>> On Wed, 2017-12-13 at 20:13 -0500, Chuck Lever wrote: >>>>> On Dec 13, 2017, at 6:42 PM, Chuck Lever <chuck.lever@xxxxxxxxx >>>>> m> >>>>> wrote: >>>>> >>>>>> >>>>>> On Dec 13, 2017, at 11:14 AM, Chuck Lever <chuck.lever@oracle >>>>>> .com >>>>>>> wrote: >>>>>> >>>>>> Hi Trond- >>>>>> >>>>>>> On Dec 3, 2017, at 7:17 PM, Trond Myklebust <trond.myklebus >>>>>>> t@pr >>>>>>> imarydata.com> wrote: >>>>>>> >>>>>>> We must ensure that the call to rpc_sleep_on() in >>>>>>> xprt_transmit() cannot >>>>>>> race with the call to xprt_complete_rqst(). >>>>>>> >>>>>>> Reported-by: Chuck Lever <chuck.lever@xxxxxxxxxx> >>>>>>> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=317 >>>>>>> Fixes: ce7c252a8c74 ("SUNRPC: Add a separate spinlock to >>>>>>> protect..") >>>>>>> Cc: stable@xxxxxxxxxxxxxxx # 4.14+ >>>>>>> Reviewed-by: Chuck Lever <chuck.lever@xxxxxxxxxx> >>>>>>> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata >>>>>>> .com >>>>>>>> >>>>>>> >>>>>>> --- >>>>>>> net/sunrpc/xprt.c | 26 +++++++++++++++++--------- >>>>>>> 1 file changed, 17 insertions(+), 9 deletions(-) >>>>>>> >>>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c >>>>>>> index 333b9d697ae5..5e4278e9ce37 100644 >>>>>>> --- a/net/sunrpc/xprt.c >>>>>>> +++ b/net/sunrpc/xprt.c >>>>>>> @@ -1024,6 +1024,7 @@ void xprt_transmit(struct rpc_task >>>>>>> *task) >>>>>>> } else if (!req->rq_bytes_sent) >>>>>>> return; >>>>>>> >>>>>>> + req->rq_connect_cookie = xprt->connect_cookie; >>>>>>> req->rq_xtime = ktime_get(); >>>>>>> status = xprt->ops->send_request(task); >>>>>>> trace_xprt_transmit(xprt, req->rq_xid, status); >>>>>>> @@ -1047,20 +1048,27 @@ void xprt_transmit(struct rpc_task >>>>>>> *task) >>>>>>> xprt->stat.bklog_u += xprt->backlog.qlen; >>>>>>> xprt->stat.sending_u += xprt->sending.qlen; >>>>>>> xprt->stat.pending_u += xprt->pending.qlen; >>>>>>> + spin_unlock_bh(&xprt->transport_lock); >>>>>>> >>>>>>> - /* Don't race with disconnect */ >>>>>>> - if (!xprt_connected(xprt)) >>>>>>> - task->tk_status = -ENOTCONN; >>>>>>> - else { >>>>>>> + if (rpc_reply_expected(task) && !READ_ONCE(req- >>>>>>>> rq_reply_bytes_recvd)) { >>>>>>> >>>>>>> /* >>>>>>> - * Sleep on the pending queue since >>>>>>> - * we're expecting a reply. >>>>>>> + * Sleep on the pending queue if we're >>>>>>> expecting a reply. >>>>>>> + * The spinlock ensures atomicity between >>>>>>> the >>>>>>> test of >>>>>>> + * req->rq_reply_bytes_recvd, and the call >>>>>>> to >>>>>>> rpc_sleep_on(). >>>>>>> */ >>>>>>> - if (!req->rq_reply_bytes_recvd && >>>>>>> rpc_reply_expected(task)) >>>>>>> + spin_lock(&xprt->recv_lock); >>>>>>> + if (!req->rq_reply_bytes_recvd) { >>>>>>> rpc_sleep_on(&xprt->pending, task, >>>>>>> xprt_timer); >>>>>>> - req->rq_connect_cookie = xprt- >>>>>>>> connect_cookie; >>>>>>> + /* >>>>>>> + * Send an extra queue wakeup call >>>>>>> if >>>>>>> the >>>>>>> + * connection was dropped in case >>>>>>> the >>>>>>> call to >>>>>>> + * rpc_sleep_on() raced. >>>>>>> + */ >>>>>>> + if (!xprt_connected(xprt)) >>>>>>> + xprt_wake_pending_tasks(xp >>>>>>> rt, >>>>>>> -ENOTCONN); >>>>>>> + } >>>>>>> + spin_unlock(&xprt->recv_lock); >>>>>>> } >>>>>>> - spin_unlock_bh(&xprt->transport_lock); >>>>>>> } >>>>>>> >>>>>>> static void xprt_add_backlog(struct rpc_xprt *xprt, struct >>>>>>> rpc_task *task) >>>>>>> -- >>>>>>> 2.14.3 >>>>>> >>>>>> I've run into a problem with this version of the patch (on >>>>>> v4.15- >>>>>> rc3). >>>>>> >>>>>> With dbench on NFSv4.0 (on RDMA) I no longer see constant >>>>>> throughput >>>>>> and a temporarily climbing latency followed by a recovery, >>>>>> but >>>>>> this >>>>>> can happen on occasion: >>>>>> >>>>>> releasing clients >>>>>> 16 584 322.55 MB/sec warmup 1 sec latency 807.121 >>>>>> ms >>>>>> 16 584 161.28 MB/sec warmup 2 sec latency >>>>>> 1807.175 ms >>>>>> 16 584 107.52 MB/sec warmup 3 sec latency >>>>>> 2807.198 ms >>>>>> 16 584 80.64 MB/sec warmup 4 sec latency >>>>>> 3807.217 ms >>>>>> 16 584 64.51 MB/sec warmup 5 sec latency >>>>>> 4807.235 ms >>>>>> 16 584 53.76 MB/sec warmup 6 sec latency >>>>>> 5807.251 ms >>>>>> 16 584 46.08 MB/sec warmup 7 sec latency >>>>>> 6807.269 ms >>>>>> 16 584 40.32 MB/sec warmup 8 sec latency >>>>>> 7807.285 ms >>>>>> 16 584 35.84 MB/sec warmup 9 sec latency >>>>>> 8807.303 ms >>>>>> 16 584 32.26 MB/sec warmup 10 sec latency >>>>>> 9807.318 ms >>>>>> 16 584 29.32 MB/sec warmup 11 sec latency >>>>>> 10807.334 >>>>>> ms >>>>>> 16 584 26.88 MB/sec warmup 12 sec latency >>>>>> 11807.355 >>>>>> ms >>>>>> 16 584 24.81 MB/sec warmup 13 sec latency >>>>>> 12807.372 >>>>>> ms >>>>>> 16 584 23.04 MB/sec warmup 14 sec latency >>>>>> 13807.387 >>>>>> ms >>>>>> 16 584 21.50 MB/sec warmup 15 sec latency >>>>>> 14807.406 >>>>>> ms >>>>>> 16 584 20.16 MB/sec warmup 16 sec latency >>>>>> 15807.423 >>>>>> ms >>>>>> >>>>>> No recovery. The latency number keeps climbing, and >>>>>> throughput >>>>>> drops. >>>>>> >>>>>> "kernel: nfs: server klimt-ib not responding, still trying" >>>>>> appears in the client's /var/log/messages. >>>>>> >>>>>> Without this patch applied, the NFSv4.0 behavior is the same >>>>>> as I >>>>>> reported with NFSv3: every once in a while, one RPC >>>>>> completion is >>>>>> lost, but when the timer fires, the client notices the reply >>>>>> actually did arrive and the RPC completes normally. >>>>>> >>>>>> I'm looking into it. >>>>> >>>>> Root cause: >>>>> >>>>> The transport is temporarily out of resources and returns >>>>> -ENOBUFS >>>>> from ->send_request. >>>>> >>>>> The FSM schedules a delay then calls call_transmit again, which >>>>> invokes xprt_prepare_transmit. >>>>> >>>>> On NFSv3 mounts, xprt_prepare_transmit tries to acquire the >>>>> transport write lock. >>>>> >>>>> On NFSv4 mounts, RPC_TASK_NO_RETRANS_TIMEOUT is set. This >>>>> causes >>>>> xprt_prepare_transmit to see that the transport is connected >>>>> and >>>>> rq_connect_cookie is the same as xprt->connect_cookie, so it >>>>> queues the task on ->pending and returns false. >>>>> >>>>> The transport never sends the Call, but still waits for a >>>>> Reply, >>>>> forever. This is not a bug in your patch, it is a pre-existing >>>>> issue. >>>> >>>> Sorry for one more reply... >>>> >>>> I think this last statement is incorrect. It is a bug in your >>>> patch after all. >>>> >>>> I forgot that your patch moves the line that sets >>>> >>>> req->rq_connect_cookie = xprt->connect_cookie; >>>> >>>> to before the call to ->send_request. But rq_connect_cookie >>>> must be set iff ->send_request is successful. Moving this >>>> line seems to fix the issue: >>>> >>>> 1027 req->rq_xtime = ktime_get(); >>>> 1028 status = xprt->ops->send_request(task); >>>> 1029 trace_xprt_transmit(xprt, req->rq_xid, status); >>>> 1030 if (status != 0) { >>>> 1031 task->tk_status = status; >>>> 1032 return; >>>> 1033 } >>>> 1034 req->rq_connect_cookie = xprt->connect_cookie; <<<< >>>> 1035 xprt_inject_disconnect(xprt); >>>> >>> >>> Let's rather apply the attached fix (incremental to the original >>> patch >>> and intended to be applied using 'git rebase --autosquash'). The >>> reason >>> for wanting to set the connect cookie before we call send_request() >>> is >>> to avoid races with disconnects that happen immediately after we >>> sent >>> the data (but before we get round to setting the cookie). >>> >>> 8<-------------------------------------------------------- >>> From 326c12727b179d5c2fd42f374a4c0e2d56ee90c5 Mon Sep 17 00:00:00 >>> 2001 >>> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> >>> Date: Thu, 14 Dec 2017 07:05:27 -0500 >>> Subject: [PATCH] fixup! SUNRPC: Fix a race in the receive code path >>> >>> --- >>> net/sunrpc/xprt.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c >>> index 5e4278e9ce37..2cf4cf21a878 100644 >>> --- a/net/sunrpc/xprt.c >>> +++ b/net/sunrpc/xprt.c >>> @@ -1021,10 +1021,11 @@ void xprt_transmit(struct rpc_task *task) >>> /* Turn off autodisconnect */ >>> del_singleshot_timer_sync(&xprt->timer); >>> } >>> + if (!req->rq_bytes_sent) >>> + req->rq_connect_cookie = xprt- >>>> connect_cookie; >>> } else if (!req->rq_bytes_sent) >>> return; >>> >>> - req->rq_connect_cookie = xprt->connect_cookie; >>> req->rq_xtime = ktime_get(); >>> status = xprt->ops->send_request(task); >>> trace_xprt_transmit(xprt, req->rq_xid, status); >>> -- >>> 2.14.3 >> >> Tested with a client that returns -ENOBUFS from ->send_request >> with a little extra frequency. >> >> This doesn't appear to address the problem. The transport >> deadlocks as before when send_request() fails to send the RPC. > > Does the RDMA code update the connect cookie when the connection > breaks? It looks to me as if it only does that when the connection is > re-established. We really want both. > >> I imagine this issue could also impact write buffer exhaustion >> on TCP. > > See net/sunrpc/xprtsock.c:xs_tcp_state_change() xprtrdma manipulates the connect_cookie in its connect worker, see rpcrdma_connect_worker. This was added by: commit 575448bd36208f99fe0dd554a43518d798966740 Author: Tom Talpey <talpey@xxxxxxxxxx> AuthorDate: Thu Oct 9 15:00:40 2008 -0400 Commit: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> CommitDate: Fri Oct 10 15:10:36 2008 -0400 RPC/RDMA: suppress retransmit on RPC/RDMA clients. Would it be more correct to bump the cookie in rpcrdma_conn_upcall, which is the equivalent to xs_tcp_state_change? (if so, why, so I can compose a reasonable patch description) It could be bumped in the RDMA_CM_EVENT_ESTABLISHED and the RDMA_CM_EVENT_DISCONNECTED cases, for example. I'm not sure RDMA provides a distinction between "server disconnected" and "client disconnected" although that probably does not matter for this purpose. But, why would the additional cookie update help? The transport is not disconnecting before the deadlock. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html