NFSD callback operations block everything when clients are unresponsive

Bugspray Bot <bugbot@xxxxxxxxxx> · Fri, 13 Sep 2024 20:05:07 +0000

cel writes via Kernel.org Bugzilla:

Several reporters note that after commit c1ccfcf1a9bf ("NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down"), NFSD's callback work queue is blocked when one of the clients is unresponsive.

We know that NFSD's callback_wq is single-threaded (ordered), and that there is only one WQ for all of the NFS server's clients.

What blocks callback operations is the retry loop in nfsd4_run_cb_work(). It was added to ensure that CB_OFFLOAD operations are delivered reliably, but it causes head-of-queue blocking when any NFS client becomes unresponsive when a callback operation is pending.

We've partially addressed this by giving each lease its own callback_wq.

However it's clear that retrying callback operations from within the callback WQ is going to be problematic to some extent. The solution is to hoist the responsibility for retrying higher up into the individual implementations of the callback operations (CB_RECALL, CB_NOTIFY_LOCK, CB_OFFLOAD, and so on), since each of these operations has their own needs in terms of recourse when a callback operation cannot be sent.

View: https://bugzilla.kernel.org/show_bug.cgi?id=218735#c0
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)