Re: [PATCH V2 0/4] nvme: fix two kinds of IO hang from removing NSs

Sagi Grimberg <sagi@xxxxxxxxxxx> · Sun, 25 Jun 2023 11:09:45 +0300

The point was to contain requests from entering while the hctx's are
being reconfigured. If you're going to pair up the freezes as you've
suggested, we might as well just not call freeze at all.

blk_mq_update_nr_hw_queues() requires queue to be frozen.

It's too late at that point. Let's work through a real example. You'll
need a system that has more CPU's than your nvme has IO queues.

Boot without any special nvme parameters. Every possible nvme IO queue
will be assigned "default" hctx type. Now start IO to every queue, then
run:

   # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller

Today, we freeze prior to tearing down the "default" IO queues, so
there's nothing entered into them while the driver reconfigures the
queues.

nvme_start_freeze() just prevents new IO from being queued, and old ones
may still be entering block layer queue, and what matters here is
actually quiesce, which prevents new IO from being queued to
driver/hardware.

What you're suggesting will allow IO to queue up in a queisced "default"
queue, which will become "polled" without an interrupt hanlder on the
other side of the reset. The application doesn't know that, so the IO
you're allowing to queue up will time out.

time out only happens after the request is queued to driver/hardware, or after
blk_mq_start_request() is called in nvme_queue_rq(), but quiesce actually
prevents new IOs from being dispatched to driver or be queued via .queue_rq(),
meantime old requests have been canceled, so no any request can be
timed out.

Quiesce doesn't prevent requests from entering an hctx, and you can't
back it out to put on another hctx later. It doesn't matter that you
haven't dispatched it to hardware yet. The request's queue was set the
moment it was allocated, so after you unquiesce and freeze for the new
queue mapping, the requests previously blocked on quiesce will time out
in the scenario I've described.

There are certainly gaps in the existing code where error'ed requests
can be requeued or stuck elsewhere and hit the exact same problem, but
the current way at least tries to contain it.

Yeah, but you can't remove the gap at all with start_freeze, that said
the current code has to live with the situation of new mapping change
and old request with old mapping.

Actually I considered to handle this kind of situation before, one approach
is to reuse the bio steal logic taken in nvme mpath:

1) for FS IO, re-submit bios, meantime free request

2) for PT request, simply fail it

It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is
always set for PT request, but not see any better approach for handling
PT request.

Ming,

I suggest to submit patches for tcp/rdma and continue the discussion on
the pci driver.