Re: [PATCH v1] NFS: Detect unreachable NFS/RDMA servers more reliably

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Fri, 13 Jan 2017 17:27:27 +0000

On Fri, 2017-01-13 at 10:13 -0500, Chuck Lever wrote:
> > On Jan 12, 2017, at 5:15 PM, Trond Myklebust <trondmy@primarydata.c
> > om> wrote:
> > 
> > > 
> > > On Jan 12, 2017, at 12:42, Chuck Lever <chuck.lever@xxxxxxxxxx>
> > > wrote:
> > > 
> > > 
> > > > On Jan 12, 2017, at 12:38 PM, Trond Myklebust <trondmy@primaryd
> > > > ata.com> wrote:
> > > > 
> > > > On Fri, 2016-12-16 at 11:48 -0500, Chuck Lever wrote:
> > > > > Current NFS clients rely on connection loss to determine when
> > > > > to
> > > > > retransmit. In particular, for protocols like NFSv4, clients
> > > > > no
> > > > > longer rely on RPC timeouts to drive retransmission: NFSv4
> > > > > servers
> > > > > are required to terminate a connection when they need  a
> > > > > client to
> > > > > retransmit pending RPCs.
> > > > > 
> > > > > When a server is no longer reachable, either because it has
> > > > > crashed
> > > > > or because the network path has broken, the server cannot
> > > > > actively
> > > > > terminate a connection. Thus NFS clients depend on transport-
> > > > > level
> > > > > keepalive to determine when a connection must be replaced and
> > > > > pending RPCs retransmitted.
> > > > > 
> > > > > However, RDMA RC connections do not have a native keepalive
> > > > > mechanism. If an NFS/RDMA server crashes after a client has
> > > > > sent
> > > > > RPCs successfully (an RC ACK has been received for all OTW
> > > > > RDMA
> > > > > requests), there is no way for the client to know the
> > > > > connection is
> > > > > moribund.
> > > > > 
> > > > > In addition, new RDMA requests are subject to the RPC-over-
> > > > > RDMA
> > > > > credit limit. If the client has consumed all granted credits
> > > > > with
> > > > > NFS traffic, it is not allowed to send another RDMA request
> > > > > until
> > > > > the server replies. Thus it has no way to send a true
> > > > > keepalive when
> > > > > the workload has already consumed all credits with pending
> > > > > RPCs.
> > > > > 
> > > > > To address this, we reserve one RPC-over-RDMA credit that may
> > > > > be
> > > > > used only for an NFS NULL. A periodic RPC ping is done on
> > > > > transports
> > > > > whenever there are outstanding RPCs.
> > > > > 
> > > > > The purpose of this ping is to drive traffic regularly on
> > > > > each
> > > > > connection to force the transport layer to disconnect it if
> > > > > it is no
> > > > > longer viable. Some RDMA operations are fully offloaded to
> > > > > the HCA,
> > > > > and can be successful even if the remote host has crashed.
> > > > > Thus an
> > > > > operation that requires that the server is responsive is used
> > > > > for
> > > > > the ping.
> > > > > 
> > > > > This implementation re-uses existing generic RPC
> > > > > infrastructure to
> > > > > form each NULL Call. An rpc_clnt context must be available to
> > > > > start
> > > > > an RPC. Thus a generic keepalive mechanism is introduced so
> > > > > that
> > > > > both an rpc_clnt and an rpc_xprt is available to perform the
> > > > > ping.
> > > > > 
> > > > > Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> > > > > ---
> > > > > 
> > > > > Before sending this for internal testing, I'd like to hear
> > > > > comments
> > > > > on this approach. It's a little more churn than I had hoped
> > > > > for.
> > > > > 
> > > > > 
> > > > > fs/nfs/nfs4client.c             |    1 
> > > > > include/linux/sunrpc/clnt.h     |    2 +
> > > > > include/linux/sunrpc/sched.h    |    3 +
> > > > > include/linux/sunrpc/xprt.h     |    1 
> > > > > net/sunrpc/clnt.c               |  101
> > > > > +++++++++++++++++++++++++++++++++++++++
> > > > > net/sunrpc/sched.c              |   19 +++++++
> > > > > net/sunrpc/xprt.c               |    5 ++
> > > > > net/sunrpc/xprtrdma/rpc_rdma.c  |    4 +-
> > > > > net/sunrpc/xprtrdma/transport.c |   13 +++++
> > > > > 9 files changed, 148 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
> > > > > index 074ac71..c5f5ce8 100644
> > > > > --- a/fs/nfs/nfs4client.c
> > > > > +++ b/fs/nfs/nfs4client.c
> > > > > @@ -378,6 +378,7 @@ struct nfs_client
> > > > > *nfs4_init_client(struct
> > > > > nfs_client *clp,
> > > > > 		error = nfs_create_rpc_client(clp, cl_init,
> > > > > RPC_AUTH_UNIX);
> > > > > 	if (error < 0)
> > > > > 		goto error;
> > > > > +	rpc_schedule_keepalive(clp->cl_rpcclient);
> > > > 
> > > > Why do we want to enable this for non-RDMA transports?
> > > > Shouldn't this
> > > > functionality be hidden in the RDMA client code, in the same
> > > > way that
> > > > the TCP keepalive is hidden in the socket code.
> > > 
> > > Sending a NULL request by re-using the normal RPC infrastructure
> > > requires a struct rpc_clnt. Thus it has to be driven by an upper
> > > layer context.
> > > 
> > > I'm open to suggestions.
> > > 
> > 
> > Ideally we just want this to operate when there are outstanding RPC
> > calls waiting for a reply, am I correct?
> > 
> > If so, perhaps we might have it triggered by a timer that is armed
> > in xprt->ops->send_request() and disarmed in xprt->ops-
> > >release_xprt()? It might then configure itself by looking in the
> > xprt->recv list to find a hanging rpc_task and steal its rpc_client
> > info.
> 
> Perhaps, but I was hoping to find a solution that did not add more
> overhead (arming and disarming another timer) to the send_request
> path.
> 
> __mod_timer can do an irqsave spinlock in some cases, for example.
> 
> This impacts all I/O on all transports to handle a case that will
> be very rare.
> 
> We could mitigate the timer flapping by arming when xprt_transmit
> finds the recv list empty before adding, and when xprt_lookup_rqst
> empties the recv list.
> 

Alternatively, how about just putting the trigger in xprt_timer (i.e.
in the xprt->ops->timer() callback)? That requires no new timers, and
it solves the problem of which rpc_clnt to use.

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥