does nfsd reset the callback client too hastily?

"NeilBrown" <neilb@xxxxxxx> · Wed, 18 Dec 2024 13:57:55 +1100

Hi,
 I've been pondering the messages

 receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt XXXXXXX xid XXXXXX

that turn up occasionally.  Google reports a variety of hits and I've
seen them in a logs from a customer though I don't think they were
directly related to the customer's problem.

These messages suggest a callback reply from the client which the server
was not expecting.  I think the most likely cause that the server called
  rpc_shutdown_client(clp->cl_cb_client);
while there were outstanding callbacks.
This causes rpc_killall_tasks() to be called so that the tasks stop
waiting for a reply and are discarded.

The rpc_shutdown_client() call can come from nfsd4_process_cb_update()
which gets runs whenever nfsd4_probe_callback() is called.  This happens
in quite a few places including when a new connection is bound to a
session.

So if a new connection is bound, the current callback channel is aborted
even though it is working perfectly well.  That is particularly
problematic as callback request are not currently retransmitted.

So I'm wondering if nfsd4_process_cb_update() should only shutdown the
current cb client if there is evidence that it isn't work.

I'm not certain how best to do that.  One option might be to do a search
similar to that in __nfsd4_find_backchannel() and see if the current
session and xprt are still valid.  There might be a better way.

Thoughts?

Thanks,
NeilBrown