On Mon, Dec 18, 2023 at 10:17:49AM -0800, dai.ngo@xxxxxxxxxx wrote: > > On 12/18/23 8:02 AM, Chuck Lever wrote: > > On Sat, Dec 16, 2023 at 02:44:59PM -0800, dai.ngo@xxxxxxxxxx wrote: > > > On 12/15/23 7:57 PM, Chuck Lever wrote: > > What we don't know is why the callback was lost. > > > > - It could be that queue_work() returned false because of a bug. > > Note that there is a WARN_ON_ONCE() that fires in this case: if > > it fired several days before the hang, then we won't see any > > log messages for more recent misqueued work items. > > The WARN_ON_ONCE came from nfsd_break_one_deleg which is a delegation > recall and not from nfs4_cb_getattr. I suspect this is because of a > possible bug in __break_lease as question for Jeff above. OK, so there's no indication at all if nfsd4_run_cb() fails when NFSD queues CB_GETATTR? No wonder it's a silent failure. > > - It could be that nfsd4_run_cb_work() marked the backchannel down > > but somehow did not wake up any in-flight callback requests. > > > > Let's get more details about what's going on. > > > > > > > > I can add patches to nfsd-fixes to revert CB_GETATTR and let that > > > > sit for a few days while we decide how to move forward. > > > The simplest solution for this particular problem is to use wait with > > > timeout. > > The hard hang was due to an uninterruptible wait, which has now been > > reverted. > > > > Going forward, if there's no wait, there can be no timeout. The > > only approach is to handle errors properly when dispatching a > > callback. > > not even wait for 30ms for well behave client, same as nfsd_wait_for_delegreturn? 30 milliseconds is acceptable. It's very brief and can never result in a shutdown hang. I just don't want a long timeout. -- Chuck Lever