On 12/17/24 9:57 PM, NeilBrown wrote:
Hi, I've been pondering the messages receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt XXXXXXX xid XXXXXX that turn up occasionally. Google reports a variety of hits and I've seen them in a logs from a customer though I don't think they were directly related to the customer's problem.
That message isn't actionable by administrators, and risks filling the server's system journal with noise. I suggest that it be removed or turned into a trace point.
These messages suggest a callback reply from the client which the server was not expecting. I think the most likely cause that the server called rpc_shutdown_client(clp->cl_cb_client); while there were outstanding callbacks. This causes rpc_killall_tasks() to be called so that the tasks stop waiting for a reply and are discarded. The rpc_shutdown_client() call can come from nfsd4_process_cb_update() which gets runs whenever nfsd4_probe_callback() is called. This happens in quite a few places including when a new connection is bound to a session. So if a new connection is bound, the current callback channel is aborted even though it is working perfectly well. That is particularly problematic as callback request are not currently retransmitted. So I'm wondering if nfsd4_process_cb_update() should only shutdown the current cb client if there is evidence that it isn't work. I'm not certain how best to do that. One option might be to do a search similar to that in __nfsd4_find_backchannel() and see if the current session and xprt are still valid. There might be a better way. Thoughts?
Operating from memory, so this might be crazy talk: The fundamental problem is lack of ability to retransmit a callback after a reconnect. The rpc_shutdown_clnt() tosses all pending RPC tasks, making it impossible to retransmit them. I'd rather see the rpc_clnt be owned by the session instead of the nfs_client. Then the rpc_clnt could be destroyed only when the session is actually destroyed, at which point we know it is sensible and safe to discard pending callback operations. But the callback code is designed to handle both NFSv4.0 and NFSv4.1 callbacks, even though these are somewhat different beasts. NFSv4.0 operates: - on a real transport that can reestablish a connection on demand - without a session NFSv4.1 operates: - on a virtual transport, and has to wait for the client to reestablish a connection - within a session context that is supposed to survive multiple transport instances Some reorganization is needed to successfully re-anchor the rpc_clnt. -- Chuck Lever