Jeff Layton writes via Kernel.org Bugzilla: (In reply to Chuck Lever from comment #7) > The trace captures I've reviewed suggest that a callback session is in use, > so I would say the NFS minor version is 1 or higher. Perhaps it's not the > RPC_SIGNALLED test above that is the problem, but the one later in > nfsd4_cb_sequence_done(). Ok, good. Knowing that it's not v4.0 allows us to rule out some codepaths. There are a couple of other cases where we goto need_restart: The NFS4ERR_BADSESSION case does this, and also if it doesn't get a reply at all (case 1). There is also this that looks a little sketchy: ------------8<------------------- trace_nfsd_cb_free_slot(task, cb); nfsd41_cb_release_slot(cb); if (RPC_SIGNALLED(task)) goto need_restart; out: return ret; retry_nowait: if (rpc_restart_call_prepare(task)) ret = false; goto out; need_restart: if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) { trace_nfsd_cb_restart(clp, cb); task->tk_status = 0; cb->cb_need_restart = true; } return false; ------------8<------------------- Probably now the same bug, but it looks like if RPC_SIGNALLED returns true, then it'll restart the RPC after releasing the slot. It seems like that could break the reply cache handling, as the restarted call could be on a different slot. I'll look at patching that, at least, though I'm not sure it's related to the hang. More notes. The only way RPC_TASK_SIGNALLED gets set is: nfsd4_process_cb_update() rpc_shutdown_client() rpc_killall_tasks() That gets called if: if (clp->cl_flags & NFSD4_CLIENT_CB_FLAG_MASK) nfsd4_process_cb_update(cb); Which means that NFSD4_CLIENT_CB_UPDATE was probably set? NFSD4_CLIENT_CB_KILL seems less likely since that would nerf the cb_need_restart handling. View: https://bugzilla.kernel.org/show_bug.cgi?id=219710#c10 You can reply to this message to join the discussion. -- Deet-doot-dot, I am a bot. Kernel.org Bugzilla (bugspray 0.1-dev)