Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck

dai.ngo@xxxxxxxxxx · Fri, 15 Dec 2023 19:18:29 -0800

On 12/15/23 5:21 PM, Chuck Lever wrote:
On Fri, Dec 15, 2023 at 01:55:20PM -0800, dai.ngo@xxxxxxxxxx wrote:
Sorry Chuck, I didn't see this before sending v2.

On 12/15/23 1:41 PM, Chuck Lever wrote:
On Fri, Dec 15, 2023 at 12:40:07PM -0800, dai.ngo@xxxxxxxxxx wrote:
On 12/15/23 11:54 AM, Chuck Lever wrote:
On Fri, Dec 15, 2023 at 11:15:03AM -0800, Dai Ngo wrote:
If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will
also stuck waiting for the callback request to be executed. This causes
the client to hang waiting for the reply of the GETATTR and also causes
the reboot of the NFS server to hang due to the pending NFS request.

Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds
time out.

Reported-by: Wolfgang Walter <linux-nfs@xxxxxxx>
Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: Dai Ngo <dai.ngo@xxxxxxxxxx>
---
    fs/nfsd/nfs4state.c | 6 +++++-
    fs/nfsd/state.h     | 2 ++
    2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 175f3e9f5822..0cc7d4953807 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf)
    	if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags))
    		return;
+	/* set to proper status when nfsd4_cb_getattr_done runs */
+	ncf->ncf_cb_status = NFS4ERR_IO;
+
    	refcount_inc(&dp->dl_stid.sc_count);
    	if (!nfsd4_run_cb(&ncf->ncf_getattr)) {
    		refcount_dec(&dp->dl_stid.sc_count);
@@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode,
    			nfs4_cb_getattr(&dp->dl_cb_fattr);
    			spin_unlock(&ctx->flc_lock);
-			wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE);
+			wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY,
+				TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT);
I'm still thinking the timeout here should be the same (or slightly
longer than) the RPC retransmit timeout, rather than adding a new
NFSD_CB_GETATTR_TIMEOUT macro.
The NFSD_CB_GETATTR_TIMEOUT is used only when we can not submit a
work item to the workqueue so RPC is not involved here.
In the "RPC was sent successfully" case, there is an implicit
assumption here that wait_on_bit_timeout() won't time out before the
actual RPC CB_GETATTR timeout.

You've chosen timeout values that happen to work, but there's
nothing in this patch that ties the two timeout values together or
in any other way documents this implicit assumption.
The timeout value was chosen to be greater then RPC callback receive
timeout. I can add this to the commit message.
nfsd needs to handle this case properly. A commit message will not
be sufficient.

The rpc_timeout setting that is used for callbacks is not always
9 seconds. It is adjusted based on the NFSv4 lease expiry up to a
maximum of 360 seconds, if I'm reading the code correctly (see
max_cb_time).

It would be simple enough for a server admin to set a long lease
expiry while individual CB_GETATTRs are still terminating with
EIO after just 20 seconds because of this new timeout.

To handle case where server admin sets longer lease, we can set
callback timeout to be (nfsd4_lease)/10 + 5) and add a comment
in the code to indicate the dependency to max_cb_time.


Actually, a bit wait in an nfsd thread should be a no-no. Even a
wait of tens of milliseconds is bad. Enough nfsd threads go into a
wait like this and that's a denial-of-service. That's why NFSv4
callbacks are handled on a work queue and not in an nfsd thread.

That sounds reasonable. However I see in the current code there
are multiple places the nfsd thread sleeps/waits for certain events:
nfsd_file_do_acquire, nfsd41_cb_get_slot, nfsd4_cld_tracking_init,
wait_for_concurrent_writes, etc.


Is there some way the operation that triggered the CB_GETATTR can be
deferred properly and picked up by another nfsd thread when the
CB_GETATTR completes?

We can send the CB_GETATTR as an async RPC and return NFS4ERR_DELAY
to the conflict client. When the reply of the CB_GETATTR arrives,
nfsd4_cb_getattr_done can mark the delegation to indicate the
corresponding file's attributes were updated so when the conflict
client retries the GETATTR we can return the updated attributes.

We still have to have a way to detect that the client never, or
take too long, to reply to the CB_GETATTR so that we can break
the lease.

Also, the intention of the wait_on_bit is to avoid sending the
conflict client the NFS4ERR_DELAY if everything works properly
which is the normal case.

So I think this can be done but it would make the code a bit
complicate and we loose the optimization of avoiding the
NFS4ERR_DELAY.

-Dai



We need to
time out here to prevent the client (that causes the conflict) to
hang waiting for the reply of the GETATTR and to prevent the server
reboot to hang due to a pending NFS request.
Perhaps a better approach would be to not rely on a timeout, but
instead have nfs4_cb_getattr() wake up the bit wait before
returning, when it can't queue the work. That way, wait_on_bit()
will return immediately in that case.
We can detect the condition where the work item can't be queue.
But I think we still need to use wait_on_bit_timeout since there
is no guarantee that the work will be executed even if it was
queued.
This is a basic guarantee provided by the RPC layer. Can you
enumerate what other ways this path will fail without waking the bit
wait? Are those issues going to impact other callback operations?


    			if (ncf->ncf_cb_status) {
    				status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ));
    				if (status != nfserr_jukebox ||
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index f96eaa8e9413..94563a6813a6 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -135,6 +135,8 @@ struct nfs4_cb_fattr {
    /* bits for ncf_cb_flags */
    #define	CB_GETATTR_BUSY		0
+#define	NFSD_CB_GETATTR_TIMEOUT	msecs_to_jiffies(20000) /* 20 secs */
+
    /*
     * Represents a delegation stateid. The nfs4_client holds references to these
     * and they are put when it is being destroyed or when the delegation is
--
2.39.3