Re: does nfsd reset the callback client too hastily?

Chuck Lever <chuck.lever@xxxxxxxxxx> · Wed, 18 Dec 2024 08:51:00 -0500

On 12/17/24 9:57 PM, NeilBrown wrote:

Hi,
  I've been pondering the messages

  receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt XXXXXXX xid XXXXXX

that turn up occasionally.  Google reports a variety of hits and I've
seen them in a logs from a customer though I don't think they were
directly related to the customer's problem.

That message isn't actionable by administrators, and risks filling the
server's system journal with noise. I suggest that it be removed or
turned into a trace point.

These messages suggest a callback reply from the client which the server
was not expecting.  I think the most likely cause that the server called
   rpc_shutdown_client(clp->cl_cb_client);
while there were outstanding callbacks.
This causes rpc_killall_tasks() to be called so that the tasks stop
waiting for a reply and are discarded.

The rpc_shutdown_client() call can come from nfsd4_process_cb_update()
which gets runs whenever nfsd4_probe_callback() is called.  This happens
in quite a few places including when a new connection is bound to a
session.

So if a new connection is bound, the current callback channel is aborted
even though it is working perfectly well.  That is particularly
problematic as callback request are not currently retransmitted.

So I'm wondering if nfsd4_process_cb_update() should only shutdown the
current cb client if there is evidence that it isn't work.

I'm not certain how best to do that.  One option might be to do a search
similar to that in __nfsd4_find_backchannel() and see if the current
session and xprt are still valid.  There might be a better way.

Thoughts?

Operating from memory, so this might be crazy talk:

The fundamental problem is lack of ability to retransmit a callback
after a reconnect. The rpc_shutdown_clnt() tosses all pending RPC
tasks, making it impossible to retransmit them.

I'd rather see the rpc_clnt be owned by the session instead of the
nfs_client. Then the rpc_clnt could be destroyed only when the session
is actually destroyed, at which point we know it is sensible and safe
to discard pending callback operations.

But the callback code is designed to handle both NFSv4.0 and NFSv4.1
callbacks, even though these are somewhat different beasts.

NFSv4.0 operates:
- on a real transport that can reestablish a connection on demand
- without a session

NFSv4.1 operates:
- on a virtual transport, and has to wait for the client to reestablish
  a connection
- within a session context that is supposed to survive multiple
  transport instances

Some reorganization is needed to successfully re-anchor the rpc_clnt.

--
Chuck Lever