On 1/23/25 8:50 AM, Jeff Layton via Bugspray Bot wrote:
Jeff Layton writes via Kernel.org Bugzilla:
There is another scenario that could explain a hang here. From nfsd4_cb_sequence_done():
------------------8<---------------------
case -NFS4ERR_BADSLOT:
goto retry_nowait;
case -NFS4ERR_SEQ_MISORDERED:
if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) {
session->se_cb_seq_nr[cb->cb_held_slot] = 1;
goto retry_nowait;
}
break;
default:
nfsd4_mark_cb_fault(cb->cb_clp);
}
trace_nfsd_cb_free_slot(task, cb);
nfsd41_cb_release_slot(cb);
if (RPC_SIGNALLED(task))
goto need_restart;
out:
return ret;
retry_nowait:
if (rpc_restart_call_prepare(task))
ret = false;
goto out;
------------------8<---------------------
Since it doesn't check RPC_SIGNALLED in the v4.1+ case until very late in the function, it's possible to get a BADSLOT or SEQ_MISORDERED error that causes the callback client to immediately resubmit the rpc_task to the RPC engine without resubmitting to the callback workqueue.
I think that we should assume that when RPC_SIGNALLED returns true that the result is suspect, and that we should halt further processing into the CB_SEQUENCE response and restart the callback.
When cb->cb_seq_status is set to any value other than 1, that means the
client replied successfully. RPC_SIGNALLED has nothing to do with
whether a reply is suspect, it means only that the rpc_clnt has been
asked to terminate.
The potential loop you noticed is concerning, but I haven't seen
evidence in the "echo t > sysrq-trigger" output that there is a running
RPC such as you described here.
--
Chuck Lever