Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck

Chuck Lever <chuck.lever@xxxxxxxxxx> · Mon, 18 Dec 2023 14:10:57 -0500

On Mon, Dec 18, 2023 at 10:17:49AM -0800, dai.ngo@xxxxxxxxxx wrote:
> 
> On 12/18/23 8:02 AM, Chuck Lever wrote:
> > On Sat, Dec 16, 2023 at 02:44:59PM -0800, dai.ngo@xxxxxxxxxx wrote:
> > > On 12/15/23 7:57 PM, Chuck Lever wrote:
> > What we don't know is why the callback was lost.
> > 
> > - It could be that queue_work() returned false because of a bug.
> >    Note that there is a WARN_ON_ONCE() that fires in this case: if
> >    it fired several days before the hang, then we won't see any
> >    log messages for more recent misqueued work items.
> 
> The WARN_ON_ONCE came from nfsd_break_one_deleg which is a delegation
> recall and not from nfs4_cb_getattr. I suspect this is because of a
> possible bug in __break_lease as question for Jeff above.

OK, so there's no indication at all if nfsd4_run_cb() fails when
NFSD queues CB_GETATTR? No wonder it's a silent failure.

> > - It could be that nfsd4_run_cb_work() marked the backchannel down
> >    but somehow did not wake up any in-flight callback requests.
> > 
> > Let's get more details about what's going on.
> > 
> > 
> > > > I can add patches to nfsd-fixes to revert CB_GETATTR and let that
> > > > sit for a few days while we decide how to move forward.
> > > The simplest solution for this particular problem is to use wait with
> > > timeout.
> > The hard hang was due to an uninterruptible wait, which has now been
> > reverted.
> > 
> > Going forward, if there's no wait, there can be no timeout. The
> > only approach is to handle errors properly when dispatching a
> > callback.
> 
> not even wait for 30ms for well behave client, same as nfsd_wait_for_delegreturn?

30 milliseconds is acceptable. It's very brief and can never result
in a shutdown hang. I just don't want a long timeout.

-- 
Chuck Lever