Re: [External] : Re: sysfs interface to force power off

Keith Busch <kbusch@xxxxxxxxxx> · Tue, 8 Nov 2022 13:37:38 -0700

On Tue, Nov 08, 2022 at 09:16:53PM +0100, Lukas Wunner wrote:
> On Tue, Nov 08, 2022 at 09:12:44AM -0700, Keith Busch wrote:
> > On Mon, Nov 07, 2022 at 04:14:54PM -0500, James Puthukattukaran wrote:
> > > 
> > > There is a path to disable the controller and that code ran but did
> > > not help. I checked wit the nvme folks and Keith mentioned that there
> > > might be an issue with the nvme queue management. Unfortunately, we
> > > can't try newer kernels in the field. So, looking for a way to just
> > > "shut off the device" when we have scenarios like this where we can't
> > > untangle the mess. 
> > 
> > Well, I didn't request you try new kernels in the field. I asked if you
> > could experiment with a newer one on a development machine to confirm if
> > the bug was fixed by some of the significant changes in this path so
> > that we could confirm a reason to port to stable. You're going to have
> > to change your kernel to fix this observation, so it would be worth the
> > effort to know if the changes being considered actually address the
> > problem.
> 
> Current mainline still contains this problematic sequence:
> 
>   nvme_reset_work()
>     nvme_wait_freeze()
>       blk_mq_freeze_queue_wait()
> 
> So I'm inclined to believe that the issue still persists, but I agree

Yeah, that sequence exists, but there are some subtle changes with how
the workqueues account for unquiesceing hardware queues that can affect
how a freeze can make forward progress.

> I think nvme_reset_work() is overly optimistic that resetting the drive
> succeeded.  It just freezes and unfreezes the I/O queue without checking
> for errors.

I'm not sure what you mean. An nvme reset is a CC.EN 0->1 transition,
and we definitely confirm that succeeds.

If you're referring to the 1->0 transition, that has to happen after the
initial freeze/quiesce steps, but whether or not that succeeds shouldn't
be relevant to the rest of the sequence: we're about to disable the
device at the PCI level.

> In particular, nvme_wait_freeze() should call the _timeout variant of
> blk_mq_freeze_queue_wait() and cope with failure of freezing.

That would indicate we have a mismatched freeze depth or a unbalanced
quiesce problem, so the timeout freeze would just mask the underlying
issue.