> On Jan 12, 2017, at 5:15 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > >> >> On Jan 12, 2017, at 12:42, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: >> >> >>> On Jan 12, 2017, at 12:38 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: >>> >>> On Fri, 2016-12-16 at 11:48 -0500, Chuck Lever wrote: >>>> Current NFS clients rely on connection loss to determine when to >>>> retransmit. In particular, for protocols like NFSv4, clients no >>>> longer rely on RPC timeouts to drive retransmission: NFSv4 servers >>>> are required to terminate a connection when they need a client to >>>> retransmit pending RPCs. >>>> >>>> When a server is no longer reachable, either because it has crashed >>>> or because the network path has broken, the server cannot actively >>>> terminate a connection. Thus NFS clients depend on transport-level >>>> keepalive to determine when a connection must be replaced and >>>> pending RPCs retransmitted. >>>> >>>> However, RDMA RC connections do not have a native keepalive >>>> mechanism. If an NFS/RDMA server crashes after a client has sent >>>> RPCs successfully (an RC ACK has been received for all OTW RDMA >>>> requests), there is no way for the client to know the connection is >>>> moribund. >>>> >>>> In addition, new RDMA requests are subject to the RPC-over-RDMA >>>> credit limit. If the client has consumed all granted credits with >>>> NFS traffic, it is not allowed to send another RDMA request until >>>> the server replies. Thus it has no way to send a true keepalive when >>>> the workload has already consumed all credits with pending RPCs. >>>> >>>> To address this, we reserve one RPC-over-RDMA credit that may be >>>> used only for an NFS NULL. A periodic RPC ping is done on transports >>>> whenever there are outstanding RPCs. >>>> >>>> The purpose of this ping is to drive traffic regularly on each >>>> connection to force the transport layer to disconnect it if it is no >>>> longer viable. Some RDMA operations are fully offloaded to the HCA, >>>> and can be successful even if the remote host has crashed. Thus an >>>> operation that requires that the server is responsive is used for >>>> the ping. >>>> >>>> This implementation re-uses existing generic RPC infrastructure to >>>> form each NULL Call. An rpc_clnt context must be available to start >>>> an RPC. Thus a generic keepalive mechanism is introduced so that >>>> both an rpc_clnt and an rpc_xprt is available to perform the ping. >>>> >>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> >>>> --- >>>> >>>> Before sending this for internal testing, I'd like to hear comments >>>> on this approach. It's a little more churn than I had hoped for. >>>> >>>> >>>> fs/nfs/nfs4client.c | 1 >>>> include/linux/sunrpc/clnt.h | 2 + >>>> include/linux/sunrpc/sched.h | 3 + >>>> include/linux/sunrpc/xprt.h | 1 >>>> net/sunrpc/clnt.c | 101 >>>> +++++++++++++++++++++++++++++++++++++++ >>>> net/sunrpc/sched.c | 19 +++++++ >>>> net/sunrpc/xprt.c | 5 ++ >>>> net/sunrpc/xprtrdma/rpc_rdma.c | 4 +- >>>> net/sunrpc/xprtrdma/transport.c | 13 +++++ >>>> 9 files changed, 148 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c >>>> index 074ac71..c5f5ce8 100644 >>>> --- a/fs/nfs/nfs4client.c >>>> +++ b/fs/nfs/nfs4client.c >>>> @@ -378,6 +378,7 @@ struct nfs_client *nfs4_init_client(struct >>>> nfs_client *clp, >>>> error = nfs_create_rpc_client(clp, cl_init, >>>> RPC_AUTH_UNIX); >>>> if (error < 0) >>>> goto error; >>>> + rpc_schedule_keepalive(clp->cl_rpcclient); >>> >>> Why do we want to enable this for non-RDMA transports? Shouldn't this >>> functionality be hidden in the RDMA client code, in the same way that >>> the TCP keepalive is hidden in the socket code. >> >> Sending a NULL request by re-using the normal RPC infrastructure >> requires a struct rpc_clnt. Thus it has to be driven by an upper >> layer context. >> >> I'm open to suggestions. >> > > Ideally we just want this to operate when there are outstanding RPC calls waiting for a reply, am I correct? > > If so, perhaps we might have it triggered by a timer that is armed in xprt->ops->send_request() and disarmed in xprt->ops->release_xprt()? It might then configure itself by looking in the xprt->recv list to find a hanging rpc_task and steal its rpc_client info. Perhaps, but I was hoping to find a solution that did not add more overhead (arming and disarming another timer) to the send_request path. __mod_timer can do an irqsave spinlock in some cases, for example. This impacts all I/O on all transports to handle a case that will be very rare. We could mitigate the timer flapping by arming when xprt_transmit finds the recv list empty before adding, and when xprt_lookup_rqst empties the recv list. > Cheers > Trond > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html