Re: [PATCHv3 2/2] nvme: cancel requests for real

Ming Lei <tom.leiming@xxxxxxxxx> · Sat, 30 May 2020 06:23:08 +0800

On Fri, May 29, 2020 at 9:22 PM Keith Busch <kbusch@xxxxxxxxxx> wrote:
>
> On Fri, May 29, 2020 at 11:39:46AM +0800, Ming Lei wrote:
> > On Fri, May 29, 2020 at 4:19 AM Alan Adamson <alan.adamson@xxxxxxxxxx> wrote:
> > That said NVMe's
> > error handling becomes better after applying the patchs of '[PATCH
> > 0/3] blk-mq/nvme: improve
> > nvme-pci reset handler'.
>
> I think that's a bit debatable. Alan is synthesizing a truly broken
> controller. The current code will abandon this controller after about 30

Not sure it can be thought as a truly broken controller. When waiting
on nvme_wait_freeze()
during reset, the controller has been in normal state.  There is still chance to
trigger timeout by any occasional event, just like any other timeout,
which isn't
special enough for us to have to kill the controller.

> seconds. Your series will reset that broken controller indefinitely.
> Which of those options is better?

Removing controller is very horrible, because it becomes a brick
basically, together
with data loss. And we should retry enough before killing the controller.

Mys series doesn't reset indefinitely since every request is just
retried limited
times(default is 5), at least chance should be provided to retry
claimed times for IO
requests.

>
> I think degrading to an admin-only mode at some point would be preferable.

If the timeout event is occasional, this way gives up too early and
doesn't retry
claimed times, then peopele may complain for data loss.

Thanks,
Ming Lei