> On Aug 19, 2021, at 10:14 AM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Thu, 2021-08-19 at 13:16 +0000, Chuck Lever III wrote: >> >> >>> On Aug 19, 2021, at 9:01 AM, Trond Myklebust >>> <trondmy@xxxxxxxxxxxxxxx> wrote: >>> >>> On Thu, 2021-08-19 at 08:29 -0400, Chuck Lever wrote: >>>> With NFSv4.1+ on RDMA, backchannel recovery appears not to work. >>>> >>>> xprt_setup_xxx_bc() is invoked by the client's first >>>> CREATE_SESSION >>>> operation, and it always marks the rpc_clnt's transport as >>>> connected. >>>> >>>> On a subsequent CREATE_SESSION, if rpc_create() is called and >>>> xpt_bc_xprt is populated, it might not be connected (for >>>> instance, >>>> if a backchannel fault has occurred). Ensure that code path >>>> returns >>>> a connected xprt also. >>>> >>>> Reported-by: Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> >>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> >>>> --- >>>> net/sunrpc/clnt.c | 1 + >>>> 1 file changed, 1 insertion(+) >>>> >>>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c >>>> index 8b4de70e8ead..570480a649a3 100644 >>>> --- a/net/sunrpc/clnt.c >>>> +++ b/net/sunrpc/clnt.c >>>> @@ -535,6 +535,7 @@ struct rpc_clnt *rpc_create(struct >>>> rpc_create_args *args) >>>> xprt = args->bc_xprt->xpt_bc_xprt; >>>> if (xprt) { >>>> xprt_get(xprt); >>>> + xprt_set_connected(xprt); >>>> return rpc_create_xprt(args, xprt); >>>> } >>>> } >>>> >>>> >>> >>> No. This is wrong. If the connection got disconnected, then the >>> client >>> needs to reconnect and build a new connection altogether. We can't >>> just >>> make pretend that the old connection still exists. >> >> The patch description is not clear: the client has not disconnected. >> The forward channel is functioning properly, and the server has set >> SEQ4_STATUS_BACKCHANNEL_FAULT. >> >> To recover, the client sends a DESTROY_SESSION / CREATE_SESSION pair >> on the existing connection. On the server, setup_callback_client() >> invokes rpc_create() again -- it's this step that is failing during >> the second CREATE_SESSION on a connection because the old xprt >> is returned but it's still marked disconnected. > > How is that happening? The RPC client's autodisconnect is firing. This was fixed in 6820bf77864d ("svcrdma: disable timeouts on rdma backchannel"). However we're still seeing cases where backchannel recovery is not working. See: https://lore.kernel.org/linux-nfs/c3e31e57-15fa-8f70-bbb8-af5452c1f5fc@xxxxxxxxxxxxxxxx/T/#t As an experiment, we've reverted 6820bf77864d to try to provoke the bug. > As far as I can tell, the only thing that can > cause the backchannel to be marked as closed is a call to > svc_delete_xprt(), which explicitly calls xprt->xpt_bc_xprt->ops->close(xprt->xpt_bc_xprt). > > The server shouldn't be reusing anything from that svc_xprt after that. >> An alternative would be to ensure that setup_callback_client() >> always puts xpt_bc_xprt before it invokes rpc_create(). But it >> looked to me like rpc_create() already has a bunch of logic to >> deal with an existing xpt_bc_xprt. > > The connection status is managed by the transport layer, not the RPC > client layer. If it's a bug to mark a backchannel transport disconnected, then asserts should be added to capture the problem. What is the purpose of this code in rpc_create() ? 533 if (args->bc_xprt) { 534 WARN_ON_ONCE(!(args->protocol & XPRT_TRANSPORT_BC)); 535 xprt = args->bc_xprt->xpt_bc_xprt; 536 if (xprt) { 537 xprt_get(xprt); 538 return rpc_create_xprt(args, xprt); 539 } 540 } -- Chuck Lever