On Fri, 2016-12-16 at 11:48 -0500, Chuck Lever wrote: > Current NFS clients rely on connection loss to determine when to > retransmit. In particular, for protocols like NFSv4, clients no > longer rely on RPC timeouts to drive retransmission: NFSv4 servers > are required to terminate a connection when they need a client to > retransmit pending RPCs. > > When a server is no longer reachable, either because it has crashed > or because the network path has broken, the server cannot actively > terminate a connection. Thus NFS clients depend on transport-level > keepalive to determine when a connection must be replaced and > pending RPCs retransmitted. > > However, RDMA RC connections do not have a native keepalive > mechanism. If an NFS/RDMA server crashes after a client has sent > RPCs successfully (an RC ACK has been received for all OTW RDMA > requests), there is no way for the client to know the connection is > moribund. > > In addition, new RDMA requests are subject to the RPC-over-RDMA > credit limit. If the client has consumed all granted credits with > NFS traffic, it is not allowed to send another RDMA request until > the server replies. Thus it has no way to send a true keepalive when > the workload has already consumed all credits with pending RPCs. > > To address this, we reserve one RPC-over-RDMA credit that may be > used only for an NFS NULL. A periodic RPC ping is done on transports > whenever there are outstanding RPCs. > > The purpose of this ping is to drive traffic regularly on each > connection to force the transport layer to disconnect it if it is no > longer viable. Some RDMA operations are fully offloaded to the HCA, > and can be successful even if the remote host has crashed. Thus an > operation that requires that the server is responsive is used for > the ping. > > This implementation re-uses existing generic RPC infrastructure to > form each NULL Call. An rpc_clnt context must be available to start > an RPC. Thus a generic keepalive mechanism is introduced so that > both an rpc_clnt and an rpc_xprt is available to perform the ping. > > Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> > --- > > Before sending this for internal testing, I'd like to hear comments > on this approach. It's a little more churn than I had hoped for. > > > fs/nfs/nfs4client.c | 1 > include/linux/sunrpc/clnt.h | 2 + > include/linux/sunrpc/sched.h | 3 + > include/linux/sunrpc/xprt.h | 1 > net/sunrpc/clnt.c | 101 > +++++++++++++++++++++++++++++++++++++++ > net/sunrpc/sched.c | 19 +++++++ > net/sunrpc/xprt.c | 5 ++ > net/sunrpc/xprtrdma/rpc_rdma.c | 4 +- > net/sunrpc/xprtrdma/transport.c | 13 +++++ > 9 files changed, 148 insertions(+), 1 deletion(-) > > diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c > index 074ac71..c5f5ce8 100644 > --- a/fs/nfs/nfs4client.c > +++ b/fs/nfs/nfs4client.c > @@ -378,6 +378,7 @@ struct nfs_client *nfs4_init_client(struct > nfs_client *clp, > error = nfs_create_rpc_client(clp, cl_init, > RPC_AUTH_UNIX); > if (error < 0) > goto error; > + rpc_schedule_keepalive(clp->cl_rpcclient); Why do we want to enable this for non-RDMA transports? Shouldn't this functionality be hidden in the RDMA client code, in the same way that the TCP keepalive is hidden in the socket code. -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@xxxxxxxxxxxxxxx ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥