Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 12/18/23 11:10 AM, Chuck Lever wrote:
On Mon, Dec 18, 2023 at 10:17:49AM -0800, dai.ngo@xxxxxxxxxx wrote:
On 12/18/23 8:02 AM, Chuck Lever wrote:
On Sat, Dec 16, 2023 at 02:44:59PM -0800, dai.ngo@xxxxxxxxxx wrote:
On 12/15/23 7:57 PM, Chuck Lever wrote:
What we don't know is why the callback was lost.

- It could be that queue_work() returned false because of a bug.
    Note that there is a WARN_ON_ONCE() that fires in this case: if
    it fired several days before the hang, then we won't see any
    log messages for more recent misqueued work items.
The WARN_ON_ONCE came from nfsd_break_one_deleg which is a delegation
recall and not from nfs4_cb_getattr. I suspect this is because of a
possible bug in __break_lease as question for Jeff above.
OK, so there's no indication at all if nfsd4_run_cb() fails when
NFSD queues CB_GETATTR? No wonder it's a silent failure.

This patch adds a WARN_ON_ONCE just in case, but I don't this condition
will ever happen since we already had the test_and_set_bit on CB_GETATTR_BUSY
bit so the same CB_GETATTR will not be submitted to workqueue more than
once.



- It could be that nfsd4_run_cb_work() marked the backchannel down
    but somehow did not wake up any in-flight callback requests.

Let's get more details about what's going on.


I can add patches to nfsd-fixes to revert CB_GETATTR and let that
sit for a few days while we decide how to move forward.
The simplest solution for this particular problem is to use wait with
timeout.
The hard hang was due to an uninterruptible wait, which has now been
reverted.

Going forward, if there's no wait, there can be no timeout. The
only approach is to handle errors properly when dispatching a
callback.
not even wait for 30ms for well behave client, same as nfsd_wait_for_delegreturn?
30 milliseconds is acceptable. It's very brief and can never result
in a shutdown hang. I just don't want a long timeout.

Thanks! I will submit v3 patch with timeout of 30 milliseconds.

-Dai





[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux