Re: NFSD threads hang when destroying a session or client ID

Jeff Layton via Bugspray Bot <bugbot@xxxxxxxxxx> · Tue, 21 Jan 2025 17:35:14 +0000

Jeff Layton writes via Kernel.org Bugzilla:

(In reply to Chuck Lever from comment #7)
> The trace captures I've reviewed suggest that a callback session is in use,
> so I would say the NFS minor version is 1 or higher. Perhaps it's not the
> RPC_SIGNALLED test above that is the problem, but the one later in
> nfsd4_cb_sequence_done().

Ok, good. Knowing that it's not v4.0 allows us to rule out some codepaths.
There are a couple of other cases where we goto need_restart:

The NFS4ERR_BADSESSION case does this, and also if it doesn't get a reply at all (case 1).
There is also this that looks a little sketchy:

------------8<-------------------
        trace_nfsd_cb_free_slot(task, cb);
        nfsd41_cb_release_slot(cb);

        if (RPC_SIGNALLED(task))
                goto need_restart;
out:
        return ret;
retry_nowait:
        if (rpc_restart_call_prepare(task))
                ret = false;
        goto out;
need_restart:
        if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
                trace_nfsd_cb_restart(clp, cb);
                task->tk_status = 0;
                cb->cb_need_restart = true;
        }
        return false;
------------8<-------------------

Probably now the same bug, but it looks like if RPC_SIGNALLED returns true, then it'll restart the RPC after releasing the slot. It seems like that could break the reply cache handling, as the restarted call could be on a different slot. I'll look at patching that, at least, though I'm not sure it's related to the hang.

More notes. The only way RPC_TASK_SIGNALLED gets set is:

   nfsd4_process_cb_update()
      rpc_shutdown_client()
          rpc_killall_tasks()

That gets called if:

        if (clp->cl_flags & NFSD4_CLIENT_CB_FLAG_MASK)
                nfsd4_process_cb_update(cb);

Which means that NFSD4_CLIENT_CB_UPDATE was probably set? NFSD4_CLIENT_CB_KILL seems less likely since that would nerf the cb_need_restart handling.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c10
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)