Re: [PATCH v1] NFS: Detect unreachable NFS/RDMA servers more reliably

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Thu, 12 Jan 2017 17:38:32 +0000

On Fri, 2016-12-16 at 11:48 -0500, Chuck Lever wrote:
> Current NFS clients rely on connection loss to determine when to
> retransmit. In particular, for protocols like NFSv4, clients no
> longer rely on RPC timeouts to drive retransmission: NFSv4 servers
> are required to terminate a connection when they need  a client to
> retransmit pending RPCs.
> 
> When a server is no longer reachable, either because it has crashed
> or because the network path has broken, the server cannot actively
> terminate a connection. Thus NFS clients depend on transport-level
> keepalive to determine when a connection must be replaced and
> pending RPCs retransmitted.
> 
> However, RDMA RC connections do not have a native keepalive
> mechanism. If an NFS/RDMA server crashes after a client has sent
> RPCs successfully (an RC ACK has been received for all OTW RDMA
> requests), there is no way for the client to know the connection is
> moribund.
> 
> In addition, new RDMA requests are subject to the RPC-over-RDMA
> credit limit. If the client has consumed all granted credits with
> NFS traffic, it is not allowed to send another RDMA request until
> the server replies. Thus it has no way to send a true keepalive when
> the workload has already consumed all credits with pending RPCs.
> 
> To address this, we reserve one RPC-over-RDMA credit that may be
> used only for an NFS NULL. A periodic RPC ping is done on transports
> whenever there are outstanding RPCs.
> 
> The purpose of this ping is to drive traffic regularly on each
> connection to force the transport layer to disconnect it if it is no
> longer viable. Some RDMA operations are fully offloaded to the HCA,
> and can be successful even if the remote host has crashed. Thus an
> operation that requires that the server is responsive is used for
> the ping.
> 
> This implementation re-uses existing generic RPC infrastructure to
> form each NULL Call. An rpc_clnt context must be available to start
> an RPC. Thus a generic keepalive mechanism is introduced so that
> both an rpc_clnt and an rpc_xprt is available to perform the ping.
> 
> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> ---
> 
> Before sending this for internal testing, I'd like to hear comments
> on this approach. It's a little more churn than I had hoped for.
> 
> 
>  fs/nfs/nfs4client.c             |    1 
>  include/linux/sunrpc/clnt.h     |    2 +
>  include/linux/sunrpc/sched.h    |    3 +
>  include/linux/sunrpc/xprt.h     |    1 
>  net/sunrpc/clnt.c               |  101
> +++++++++++++++++++++++++++++++++++++++
>  net/sunrpc/sched.c              |   19 +++++++
>  net/sunrpc/xprt.c               |    5 ++
>  net/sunrpc/xprtrdma/rpc_rdma.c  |    4 +-
>  net/sunrpc/xprtrdma/transport.c |   13 +++++
>  9 files changed, 148 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
> index 074ac71..c5f5ce8 100644
> --- a/fs/nfs/nfs4client.c
> +++ b/fs/nfs/nfs4client.c
> @@ -378,6 +378,7 @@ struct nfs_client *nfs4_init_client(struct
> nfs_client *clp,
>  		error = nfs_create_rpc_client(clp, cl_init,
> RPC_AUTH_UNIX);
>  	if (error < 0)
>  		goto error;
> +	rpc_schedule_keepalive(clp->cl_rpcclient);

Why do we want to enable this for non-RDMA transports? Shouldn't this
functionality be hidden in the RDMA client code, in the same way that
the TCP keepalive is hidden in the socket code.

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥