Re: [PATCH RFC] NLM: GRANTED_MSG fails after client remount

Chuck Lever <chuck.lever@xxxxxxxxxx> · Wed, 9 Aug 2017 18:05:57 -0400

Try to get Manjunath's e-mail address right.

> On Aug 9, 2017, at 6:04 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
> 
> If I repeatedly run the cthon04 lock test on NFSv3, at some point
> the Linux NFS server reports "lockd: server not responding". The NFS
> server is sending a GRANTED_MSG request via TCP, but the NFS client
> lockd has restarted and changed ports. The correct recovery is for
> the NFS server to rebind and reconnect to the new client port, but
> the server never rebinds, and the request times out and fails.
> 
> The underlying problem is that the RPC client on the NFS server is
> attempting to reconnect in a loop, and does not return control to
> the NLM layer until the request times out. There is never a chance
> for the NLM layer to force a rebind until the request has failed.
> 
> To address this, set the RPC_TASK_SOFTCONN flag when sending async
> NLM requests. The request fails immediately if it cannot connect,
> and the code can force a rebind and then retry the request.
> 
> Sidebar: Using SOFTCONN could be reasonable for the NFS client side
> as well.
> 
> Sidebar: Is h_nextrebind appropriate for NLM on TCP?
> 
> Sidebar: If XPRT_BOUND is cleared while the RPC client is
> connecting, the connect can fail in xs_tcp_finish_connecting:
> 
> 2353         if (!xprt_bound(xprt))
> 2354                 goto out;
> 
> There's no recovery in xs_tcp_setup_socket for this case, so an
> error message is logged, and connection set up fails. The error
> is noise. Does the RPC client try to connect again in this case?
> 
> BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=311
> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> ---
> fs/lockd/clntproc.c |    2 +-
> fs/lockd/host.c     |    7 ++++++-
> fs/lockd/svclock.c  |    2 +-
> 3 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/lockd/clntproc.c b/fs/lockd/clntproc.c
> index 066ac31..5806e1a 100644
> --- a/fs/lockd/clntproc.c
> +++ b/fs/lockd/clntproc.c
> @@ -342,7 +342,7 @@ static struct rpc_task *__nlm_async_call(struct nlm_rqst *req, u32 proc, struct
> 		.rpc_message = msg,
> 		.callback_ops = tk_ops,
> 		.callback_data = req,
> -		.flags = RPC_TASK_ASYNC,
> +		.flags = RPC_TASK_ASYNC | RPC_TASK_SOFTCONN,
> 	};
> 
> 	dprintk("lockd: call procedure %d on %s (async)\n",
> diff --git a/fs/lockd/host.c b/fs/lockd/host.c
> index d716c99..be0f847 100644
> --- a/fs/lockd/host.c
> +++ b/fs/lockd/host.c
> @@ -490,7 +490,12 @@ struct rpc_clnt *
> nlm_rebind_host(struct nlm_host *host)
> {
> 	dprintk("lockd: rebind host %s\n", host->h_name);
> -	if (host->h_rpcclnt && time_after_eq(jiffies, host->h_nextrebind)) {
> +
> +	if (!host->h_rpcclnt)
> +		return;
> +
> +	if (time_after_eq(jiffies, host->h_nextrebind) ||
> +	    host->h_proto == IPPROTO_TCP) {
> 		rpc_force_rebind(host->h_rpcclnt);
> 		host->h_nextrebind = jiffies + NLM_HOST_REBIND;
> 	}
> diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> index 3507c80..2f64a6b 100644
> --- a/fs/lockd/svclock.c
> +++ b/fs/lockd/svclock.c
> @@ -830,7 +830,7 @@ static void nlmsvc_grant_callback(struct rpc_task *task, void *data)
> 	 * can be done, though. */
> 	if (task->tk_status < 0) {
> 		/* RPC error: Re-insert for retransmission */
> -		timeout = 10 * HZ;
> +		timeout = 5 * HZ;
> 	} else {
> 		/* Call was successful, now wait for client callback */
> 		timeout = 60 * HZ;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html