Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck

dai.ngo@xxxxxxxxxx · Sat, 16 Dec 2023 14:44:59 -0800




On 12/15/23 7:57 PM, Chuck Lever wrote:
On Fri, Dec 15, 2023 at 07:18:29PM -0800, dai.ngo@xxxxxxxxxx wrote:
On 12/15/23 5:21 PM, Chuck Lever wrote:
On Fri, Dec 15, 2023 at 01:55:20PM -0800, dai.ngo@xxxxxxxxxx wrote:
Sorry Chuck, I didn't see this before sending v2.

On 12/15/23 1:41 PM, Chuck Lever wrote:
On Fri, Dec 15, 2023 at 12:40:07PM -0800, dai.ngo@xxxxxxxxxx wrote:
On 12/15/23 11:54 AM, Chuck Lever wrote:
On Fri, Dec 15, 2023 at 11:15:03AM -0800, Dai Ngo wrote:
@@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode,
     			nfs4_cb_getattr(&dp->dl_cb_fattr);
     			spin_unlock(&ctx->flc_lock);
-			wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE);
+			wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY,
+				TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT);
I'm still thinking the timeout here should be the same (or slightly
longer than) the RPC retransmit timeout, rather than adding a new
NFSD_CB_GETATTR_TIMEOUT macro.
The NFSD_CB_GETATTR_TIMEOUT is used only when we can not submit a
work item to the workqueue so RPC is not involved here.
In the "RPC was sent successfully" case, there is an implicit
assumption here that wait_on_bit_timeout() won't time out before the
actual RPC CB_GETATTR timeout.

You've chosen timeout values that happen to work, but there's
nothing in this patch that ties the two timeout values together or
in any other way documents this implicit assumption.
The timeout value was chosen to be greater then RPC callback receive
timeout. I can add this to the commit message.
nfsd needs to handle this case properly. A commit message will not
be sufficient.

The rpc_timeout setting that is used for callbacks is not always
9 seconds. It is adjusted based on the NFSv4 lease expiry up to a
maximum of 360 seconds, if I'm reading the code correctly (see
max_cb_time).

It would be simple enough for a server admin to set a long lease
expiry while individual CB_GETATTRs are still terminating with
EIO after just 20 seconds because of this new timeout.

With courteous server, what is the benefit of allowing the admin to
extend the lease? I think this option should be removed, it's just
an administrative overhead and can cause more confusion.

To handle case where server admin sets longer lease, we can set
callback timeout to be (nfsd4_lease)/10 + 5) and add a comment
in the code to indicate the dependency to max_cb_time.
Which was my initial suggestion, but now I think the only proper fix
is to replace the wait_on_bit() entirely.


Actually, a bit wait in an nfsd thread should be a no-no. Even a
wait of tens of milliseconds is bad. Enough nfsd threads go into a
wait like this and that's a denial-of-service. That's why NFSv4
callbacks are handled on a work queue and not in an nfsd thread.
That sounds reasonable. However I see in the current code there
are multiple places the nfsd thread sleeps/waits for certain events:
nfsd_file_do_acquire, nfsd41_cb_get_slot, nfsd4_cld_tracking_init,
wait_for_concurrent_writes, etc.
Yep, each of those needs to be replaced if there is a danger of the
wait becoming indefinite. We found another one recently with the
pNFS block layout type waiting for a lease breaker. So an nfsd
thread does wait on occasion, but it's almost never the right thing
to do.

Let's not add another one, especially one that can be manipulated by
(bad) client behavior.

I'm not clear how the wait for CB_GETATTR can be manipulated by a bad
client, can you elaborate?



Is there some way the operation that triggered the CB_GETATTR can be
deferred properly and picked up by another nfsd thread when the
CB_GETATTR completes?
We can send the CB_GETATTR as an async RPC and return NFS4ERR_DELAY
to the conflict client. When the reply of the CB_GETATTR arrives,
nfsd4_cb_getattr_done can mark the delegation to indicate the
corresponding file's attributes were updated so when the conflict
client retries the GETATTR we can return the updated attributes.

We still have to have a way to detect that the client never, or
take too long, to reply to the CB_GETATTR so that we can break
the lease.
As long as the RPC is SOFT, the RPC client is supposed to guarantee
that the upper layer gets a completion of some kind. If it's HARD
then it should fully interruptible by a signal or shutdown.

This is only true if the workqueue worker runs and executes the work
item successfully. If the work item is stuck at the workqueue then RPC
is not involved. NFSD must handle the case where the work item is
never executed for any reasons.


Either way, if NFSD can manage to reliably detect when the CB RPC
has not even been scheduled, then that gives us a fully dependable
guarantee.

Once the async CB RPC was passed to the RPC layer via rpc_run_task,
I can't see any sure way to know when the RPC task will be picked up
and run. Until the RPC task is run, the RPC time out mechanism is not
in play. To handle this condition, I think a timeout mechanism is
needed at the NFSD layer, any other options?



Also, the intention of the wait_on_bit is to avoid sending the
conflict client the NFS4ERR_DELAY if everything works properly
which is the normal case.

So I think this can be done but it would make the code a bit
complicate and we loose the optimization of avoiding the
NFS4ERR_DELAY.
I'm not excited about handling this via DELAY either. There's a
good chance there is another way to deal with this sanely.

I'm inclined to revert CB_GETATTR support until it is demonstrated
to be working reliably. The current mechanism has already shown to
be prone to a hard hang that blocks server shutdown, and it's not
even in the wild yet.

If I understand the problem correctly, this hard hang issue is due to
the work item being stuck at the workqueue which the current code does
not take into account.


I can add patches to nfsd-fixes to revert CB_GETATTR and let that
sit for a few days while we decide how to move forward.

The simplest solution for this particular problem is to use wait with
timeout. But if that solution does not meet your expectation then yes
reverting the CB_GETATTR, or remove the handling of GETATTR conflict
with write delegation, will definitely resolve this problem.

-Dai