On Thu, 2021-08-19 at 13:16 +0000, Chuck Lever III wrote: > > > > On Aug 19, 2021, at 9:01 AM, Trond Myklebust > > <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > On Thu, 2021-08-19 at 08:29 -0400, Chuck Lever wrote: > > > With NFSv4.1+ on RDMA, backchannel recovery appears not to work. > > > > > > xprt_setup_xxx_bc() is invoked by the client's first > > > CREATE_SESSION > > > operation, and it always marks the rpc_clnt's transport as > > > connected. > > > > > > On a subsequent CREATE_SESSION, if rpc_create() is called and > > > xpt_bc_xprt is populated, it might not be connected (for > > > instance, > > > if a backchannel fault has occurred). Ensure that code path > > > returns > > > a connected xprt also. > > > > > > Reported-by: Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> > > > Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> > > > --- > > > net/sunrpc/clnt.c | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c > > > index 8b4de70e8ead..570480a649a3 100644 > > > --- a/net/sunrpc/clnt.c > > > +++ b/net/sunrpc/clnt.c > > > @@ -535,6 +535,7 @@ struct rpc_clnt *rpc_create(struct > > > rpc_create_args *args) > > > xprt = args->bc_xprt->xpt_bc_xprt; > > > if (xprt) { > > > xprt_get(xprt); > > > + xprt_set_connected(xprt); > > > return rpc_create_xprt(args, xprt); > > > } > > > } > > > > > > > > > > No. This is wrong. If the connection got disconnected, then the > > client > > needs to reconnect and build a new connection altogether. We can't > > just > > make pretend that the old connection still exists. > > The patch description is not clear: the client has not disconnected. > The forward channel is functioning properly, and the server has set > SEQ4_STATUS_BACKCHANNEL_FAULT. > > To recover, the client sends a DESTROY_SESSION / CREATE_SESSION pair > on the existing connection. On the server, setup_callback_client() > invokes rpc_create() again -- it's this step that is failing during > the second CREATE_SESSION on a connection because the old xprt > is returned but it's still marked disconnected. > How is that happening? As far as I can tell, the only thing that can cause the backchannel to be marked as closed is a call to svc_delete_xprt(), which explicitly calls xprt->xpt_bc_xprt->ops- >close(xprt->xpt_bc_xprt). The server shouldn't be reusing anything from that svc_xprt after that. > An alternative would be to ensure that setup_callback_client() > always puts xpt_bc_xprt before it invokes rpc_create(). But it > looked to me like rpc_create() already has a bunch of logic to > deal with an existing xpt_bc_xprt. > > The connection status is managed by the transport layer, not the RPC client layer. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx