If I repeatedly run the cthon04 lock test on NFSv3, at some point the Linux NFS server reports "lockd: server not responding". The NFS server is sending a GRANTED_MSG request via TCP, but the NFS client lockd has restarted and changed ports. The correct recovery is for the NFS server to rebind and reconnect to the new client port, but the server never rebinds, and the request times out and fails. The underlying problem is that the RPC client on the NFS server is attempting to reconnect in a loop, and does not return control to the NLM layer until the request times out. There is never a chance for the NLM layer to force a rebind until the request has failed. To address this, set the RPC_TASK_SOFTCONN flag when sending async NLM requests. The request fails immediately if it cannot connect, and the code can force a rebind and then retry the request. Sidebar: Using SOFTCONN could be reasonable for the NFS client side as well. Sidebar: Is h_nextrebind appropriate for NLM on TCP? Sidebar: If XPRT_BOUND is cleared while the RPC client is connecting, the connect can fail in xs_tcp_finish_connecting: 2353 if (!xprt_bound(xprt)) 2354 goto out; There's no recovery in xs_tcp_setup_socket for this case, so an error message is logged, and connection set up fails. The error is noise. Does the RPC client try to connect again in this case? BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=311 Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> --- fs/lockd/clntproc.c | 2 +- fs/lockd/host.c | 7 ++++++- fs/lockd/svclock.c | 2 +- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/lockd/clntproc.c b/fs/lockd/clntproc.c index 066ac31..5806e1a 100644 --- a/fs/lockd/clntproc.c +++ b/fs/lockd/clntproc.c @@ -342,7 +342,7 @@ static struct rpc_task *__nlm_async_call(struct nlm_rqst *req, u32 proc, struct .rpc_message = msg, .callback_ops = tk_ops, .callback_data = req, - .flags = RPC_TASK_ASYNC, + .flags = RPC_TASK_ASYNC | RPC_TASK_SOFTCONN, }; dprintk("lockd: call procedure %d on %s (async)\n", diff --git a/fs/lockd/host.c b/fs/lockd/host.c index d716c99..be0f847 100644 --- a/fs/lockd/host.c +++ b/fs/lockd/host.c @@ -490,7 +490,12 @@ struct rpc_clnt * nlm_rebind_host(struct nlm_host *host) { dprintk("lockd: rebind host %s\n", host->h_name); - if (host->h_rpcclnt && time_after_eq(jiffies, host->h_nextrebind)) { + + if (!host->h_rpcclnt) + return; + + if (time_after_eq(jiffies, host->h_nextrebind) || + host->h_proto == IPPROTO_TCP) { rpc_force_rebind(host->h_rpcclnt); host->h_nextrebind = jiffies + NLM_HOST_REBIND; } diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c index 3507c80..2f64a6b 100644 --- a/fs/lockd/svclock.c +++ b/fs/lockd/svclock.c @@ -830,7 +830,7 @@ static void nlmsvc_grant_callback(struct rpc_task *task, void *data) * can be done, though. */ if (task->tk_status < 0) { /* RPC error: Re-insert for retransmission */ - timeout = 10 * HZ; + timeout = 5 * HZ; } else { /* Call was successful, now wait for client callback */ timeout = 60 * HZ; -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html