> On Dec 13, 2017, at 11:14 AM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > > Hi Trond- > >> On Dec 3, 2017, at 7:17 PM, Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> wrote: >> >> We must ensure that the call to rpc_sleep_on() in xprt_transmit() cannot >> race with the call to xprt_complete_rqst(). >> >> Reported-by: Chuck Lever <chuck.lever@xxxxxxxxxx> >> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=317 >> Fixes: ce7c252a8c74 ("SUNRPC: Add a separate spinlock to protect..") >> Cc: stable@xxxxxxxxxxxxxxx # 4.14+ >> Reviewed-by: Chuck Lever <chuck.lever@xxxxxxxxxx> >> Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> >> --- >> net/sunrpc/xprt.c | 26 +++++++++++++++++--------- >> 1 file changed, 17 insertions(+), 9 deletions(-) >> >> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c >> index 333b9d697ae5..5e4278e9ce37 100644 >> --- a/net/sunrpc/xprt.c >> +++ b/net/sunrpc/xprt.c >> @@ -1024,6 +1024,7 @@ void xprt_transmit(struct rpc_task *task) >> } else if (!req->rq_bytes_sent) >> return; >> >> + req->rq_connect_cookie = xprt->connect_cookie; >> req->rq_xtime = ktime_get(); >> status = xprt->ops->send_request(task); >> trace_xprt_transmit(xprt, req->rq_xid, status); >> @@ -1047,20 +1048,27 @@ void xprt_transmit(struct rpc_task *task) >> xprt->stat.bklog_u += xprt->backlog.qlen; >> xprt->stat.sending_u += xprt->sending.qlen; >> xprt->stat.pending_u += xprt->pending.qlen; >> + spin_unlock_bh(&xprt->transport_lock); >> >> - /* Don't race with disconnect */ >> - if (!xprt_connected(xprt)) >> - task->tk_status = -ENOTCONN; >> - else { >> + if (rpc_reply_expected(task) && !READ_ONCE(req->rq_reply_bytes_recvd)) { >> /* >> - * Sleep on the pending queue since >> - * we're expecting a reply. >> + * Sleep on the pending queue if we're expecting a reply. >> + * The spinlock ensures atomicity between the test of >> + * req->rq_reply_bytes_recvd, and the call to rpc_sleep_on(). >> */ >> - if (!req->rq_reply_bytes_recvd && rpc_reply_expected(task)) >> + spin_lock(&xprt->recv_lock); >> + if (!req->rq_reply_bytes_recvd) { >> rpc_sleep_on(&xprt->pending, task, xprt_timer); >> - req->rq_connect_cookie = xprt->connect_cookie; >> + /* >> + * Send an extra queue wakeup call if the >> + * connection was dropped in case the call to >> + * rpc_sleep_on() raced. >> + */ >> + if (!xprt_connected(xprt)) >> + xprt_wake_pending_tasks(xprt, -ENOTCONN); >> + } >> + spin_unlock(&xprt->recv_lock); >> } >> - spin_unlock_bh(&xprt->transport_lock); >> } >> >> static void xprt_add_backlog(struct rpc_xprt *xprt, struct rpc_task *task) >> -- >> 2.14.3 > > I've run into a problem with this version of the patch (on v4.15-rc3). > > With dbench on NFSv4.0 (on RDMA) I no longer see constant throughput > and a temporarily climbing latency followed by a recovery, but this > can happen on occasion: > > releasing clients > 16 584 322.55 MB/sec warmup 1 sec latency 807.121 ms > 16 584 161.28 MB/sec warmup 2 sec latency 1807.175 ms > 16 584 107.52 MB/sec warmup 3 sec latency 2807.198 ms > 16 584 80.64 MB/sec warmup 4 sec latency 3807.217 ms > 16 584 64.51 MB/sec warmup 5 sec latency 4807.235 ms > 16 584 53.76 MB/sec warmup 6 sec latency 5807.251 ms > 16 584 46.08 MB/sec warmup 7 sec latency 6807.269 ms > 16 584 40.32 MB/sec warmup 8 sec latency 7807.285 ms > 16 584 35.84 MB/sec warmup 9 sec latency 8807.303 ms > 16 584 32.26 MB/sec warmup 10 sec latency 9807.318 ms > 16 584 29.32 MB/sec warmup 11 sec latency 10807.334 ms > 16 584 26.88 MB/sec warmup 12 sec latency 11807.355 ms > 16 584 24.81 MB/sec warmup 13 sec latency 12807.372 ms > 16 584 23.04 MB/sec warmup 14 sec latency 13807.387 ms > 16 584 21.50 MB/sec warmup 15 sec latency 14807.406 ms > 16 584 20.16 MB/sec warmup 16 sec latency 15807.423 ms > > No recovery. The latency number keeps climbing, and throughput > drops. > > "kernel: nfs: server klimt-ib not responding, still trying" > appears in the client's /var/log/messages. > > Without this patch applied, the NFSv4.0 behavior is the same as I > reported with NFSv3: every once in a while, one RPC completion is > lost, but when the timer fires, the client notices the reply > actually did arrive and the RPC completes normally. > > I'm looking into it. Root cause: The transport is temporarily out of resources and returns -ENOBUFS from ->send_request. The FSM schedules a delay then calls call_transmit again, which invokes xprt_prepare_transmit. On NFSv3 mounts, xprt_prepare_transmit tries to acquire the transport write lock. On NFSv4 mounts, RPC_TASK_NO_RETRANS_TIMEOUT is set. This causes xprt_prepare_transmit to see that the transport is connected and rq_connect_cookie is the same as xprt->connect_cookie, so it queues the task on ->pending and returns false. The transport never sends the Call, but still waits for a Reply, forever. This is not a bug in your patch, it is a pre-existing issue. There's a simple fix, which is to set rq_connect_cookie = xprt->connect_cookie - 1; if RPC-over-RDMA header marshaling fails. This is a one-line fix, appropriate for -rc and backport to stable (to at least v4.8). A more robust fix might be to have ->send_request return -EAGAIN instead, and let xprtrdma call xprt_write_space when the resource exhaustion is relieved. That might be more involved, and thus is future-facing. I'm investigating both fixes. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html