Re: does nfsd reset the callback client too hastily?

"NeilBrown" <neilb@xxxxxxx> · Thu, 19 Dec 2024 09:10:44 +1100

On Thu, 19 Dec 2024, Chuck Lever wrote:
> On 12/17/24 9:57 PM, NeilBrown wrote:
> > 
> > Hi,
> >   I've been pondering the messages
> > 
> >   receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt XXXXXXX xid XXXXXX
> > 
> > that turn up occasionally.  Google reports a variety of hits and I've
> > seen them in a logs from a customer though I don't think they were
> > directly related to the customer's problem.
> 
> That message isn't actionable by administrators, and risks filling the
> server's system journal with noise. I suggest that it be removed or
> turned into a trace point.

As Olga notes it has already been removed.  But the point was not about
the message but the behaviour that leads to it.  Does is really make
sense to reset the client when there is no evidence of failure?

Maybe I'll send a patch.

> 
> 
> > These messages suggest a callback reply from the client which the server
> > was not expecting.  I think the most likely cause that the server called
> >    rpc_shutdown_client(clp->cl_cb_client);
> > while there were outstanding callbacks.
> > This causes rpc_killall_tasks() to be called so that the tasks stop
> > waiting for a reply and are discarded.
> > 
> > The rpc_shutdown_client() call can come from nfsd4_process_cb_update()
> > which gets runs whenever nfsd4_probe_callback() is called.  This happens
> > in quite a few places including when a new connection is bound to a
> > session.
> > 
> > So if a new connection is bound, the current callback channel is aborted
> > even though it is working perfectly well.  That is particularly
> > problematic as callback request are not currently retransmitted.
> > 
> > So I'm wondering if nfsd4_process_cb_update() should only shutdown the
> > current cb client if there is evidence that it isn't work.
> > 
> > I'm not certain how best to do that.  One option might be to do a search
> > similar to that in __nfsd4_find_backchannel() and see if the current
> > session and xprt are still valid.  There might be a better way.
> > 
> > Thoughts?
> 
> Operating from memory, so this might be crazy talk:
> 
> The fundamental problem is lack of ability to retransmit a callback
> after a reconnect. The rpc_shutdown_clnt() tosses all pending RPC
> tasks, making it impossible to retransmit them.
> 
> I'd rather see the rpc_clnt be owned by the session instead of the
> nfs_client. Then the rpc_clnt could be destroyed only when the session
> is actually destroyed, at which point we know it is sensible and safe
> to discard pending callback operations.
> 
> But the callback code is designed to handle both NFSv4.0 and NFSv4.1
> callbacks, even though these are somewhat different beasts.
> 
> NFSv4.0 operates:
> - on a real transport that can reestablish a connection on demand
> - without a session
> 
> NFSv4.1 operates:
> - on a virtual transport, and has to wait for the client to reestablish
>    a connection
> - within a session context that is supposed to survive multiple
>    transport instances
> 
> Some reorganization is needed to successfully re-anchor the rpc_clnt.

That's helpful - thanks.
I wonder if the xprtmultipath infrastructure could be used to attach all
the bound xprts to the rpc_clnt.

I'll see what I can come up with.

NeilBrown

> 
> -- 
> Chuck Lever
>