On 1/21/25 2:38 PM, Tom Talpey wrote:
On 1/21/2025 12:35 PM, Jeff Layton via Bugspray Bot wrote:
Jeff Layton writes via Kernel.org Bugzilla:
(In reply to Chuck Lever from comment #7)
The trace captures I've reviewed suggest that a callback session is
in use,
so I would say the NFS minor version is 1 or higher. Perhaps it's not
the
RPC_SIGNALLED test above that is the problem, but the one later in
nfsd4_cb_sequence_done().
Ok, good. Knowing that it's not v4.0 allows us to rule out some
codepaths.
There are a couple of other cases where we goto need_restart:
The NFS4ERR_BADSESSION case does this, and also if it doesn't get a
reply at all (case 1).
Note that one thread in Benoît's recent logs is stuck in
nfsd4_bind_conn_to_session(), and three threads also in
nfsd4_destroy_session(), so there is certainly some
session/connection dance going on. Combining an invalid
replay cache entry could easily make things worse.
Yes, the client returns RETRY_UNCACHED_REP for some of the backchannel
operations. NFSD never asserts cachethis in CB_SEQUENCE. I'm trying to
understand why NFSD would skip incrementing its slot sequence number.
There's also one thread in nfsd4_destroy_clientid(), which
seems important, but odd. And finally, the laundromat is
running. No shortage of races!
The hangs are all related here: they are waiting for flush_workqueue()
on the callback workqueue. In v6.1, there is only one callback_wq and
it's max_active is 1. If the current work item hangs, then that work
queue stalls.
Tom.
There is also this that looks a little sketchy:
------------8<-------------------
trace_nfsd_cb_free_slot(task, cb);
nfsd41_cb_release_slot(cb);
if (RPC_SIGNALLED(task))
goto need_restart;
out:
return ret;
retry_nowait:
if (rpc_restart_call_prepare(task))
ret = false;
goto out;
need_restart:
if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
trace_nfsd_cb_restart(clp, cb);
task->tk_status = 0;
cb->cb_need_restart = true;
}
return false;
------------8<-------------------
Probably now the same bug, but it looks like if RPC_SIGNALLED returns
true, then it'll restart the RPC after releasing the slot. It seems
like that could break the reply cache handling, as the restarted call
could be on a different slot. I'll look at patching that, at least,
though I'm not sure it's related to the hang.
More notes. The only way RPC_TASK_SIGNALLED gets set is:
nfsd4_process_cb_update()
rpc_shutdown_client()
rpc_killall_tasks()
That gets called if:
if (clp->cl_flags & NFSD4_CLIENT_CB_FLAG_MASK)
nfsd4_process_cb_update(cb);
Which means that NFSD4_CLIENT_CB_UPDATE was probably set?
NFSD4_CLIENT_CB_KILL seems less likely since that would nerf the
cb_need_restart handling.
View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c10
You can reply to this message to join the discussion.
--
Chuck Lever