Re: [PATCH V2 0/4] nvme: fix two kinds of IO hang from removing NSs

Ming Lei <ming.lei@xxxxxxxxxx> · Sun, 25 Jun 2023 08:26:48 +0800

On Thu, Jun 22, 2023 at 09:19:19AM -0600, Keith Busch wrote:
> On Thu, Jun 22, 2023 at 10:53:05PM +0800, Ming Lei wrote:
> > On Thu, Jun 22, 2023 at 08:35:49AM -0600, Keith Busch wrote:
> > > On Thu, Jun 22, 2023 at 09:51:12PM +0800, Ming Lei wrote:
> > > > On Wed, Jun 21, 2023 at 09:48:49AM -0600, Keith Busch wrote:
> > > > > The point was to contain requests from entering while the hctx's are
> > > > > being reconfigured. If you're going to pair up the freezes as you've
> > > > > suggested, we might as well just not call freeze at all.
> > > > 
> > > > blk_mq_update_nr_hw_queues() requires queue to be frozen.
> > > 
> > > It's too late at that point. Let's work through a real example. You'll
> > > need a system that has more CPU's than your nvme has IO queues.
> > > 
> > > Boot without any special nvme parameters. Every possible nvme IO queue
> > > will be assigned "default" hctx type. Now start IO to every queue, then
> > > run:
> > > 
> > >   # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller
> > > 
> > > Today, we freeze prior to tearing down the "default" IO queues, so
> > > there's nothing entered into them while the driver reconfigures the
> > > queues.
> > 
> > nvme_start_freeze() just prevents new IO from being queued, and old ones
> > may still be entering block layer queue, and what matters here is
> > actually quiesce, which prevents new IO from being queued to
> > driver/hardware.
> > 
> > > 
> > > What you're suggesting will allow IO to queue up in a queisced "default"
> > > queue, which will become "polled" without an interrupt hanlder on the
> > > other side of the reset. The application doesn't know that, so the IO
> > > you're allowing to queue up will time out.
> > 
> > time out only happens after the request is queued to driver/hardware, or after
> > blk_mq_start_request() is called in nvme_queue_rq(), but quiesce actually
> > prevents new IOs from being dispatched to driver or be queued via .queue_rq(),
> > meantime old requests have been canceled, so no any request can be
> > timed out.
> 
> Quiesce doesn't prevent requests from entering an hctx, and you can't
> back it out to put on another hctx later. It doesn't matter that you
> haven't dispatched it to hardware yet. The request's queue was set the
> moment it was allocated, so after you unquiesce and freeze for the new
> queue mapping, the requests previously blocked on quiesce will time out
> in the scenario I've described.
> 
> There are certainly gaps in the existing code where error'ed requests
> can be requeued or stuck elsewhere and hit the exact same problem, but
> the current way at least tries to contain it.

Yeah, but you can't remove the gap at all with start_freeze, that said
the current code has to live with the situation of new mapping change
and old request with old mapping.

Actually I considered to handle this kind of situation before, one approach
is to reuse the bio steal logic taken in nvme mpath:

1) for FS IO, re-submit bios, meantime free request

2) for PT request, simply fail it

It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is
always set for PT request, but not see any better approach for handling
PT request.

Thanks,
Ming