On Thu, Jun 22, 2023 at 09:19:19AM -0600, Keith Busch wrote: > On Thu, Jun 22, 2023 at 10:53:05PM +0800, Ming Lei wrote: > > On Thu, Jun 22, 2023 at 08:35:49AM -0600, Keith Busch wrote: > > > On Thu, Jun 22, 2023 at 09:51:12PM +0800, Ming Lei wrote: > > > > On Wed, Jun 21, 2023 at 09:48:49AM -0600, Keith Busch wrote: > > > > > The point was to contain requests from entering while the hctx's are > > > > > being reconfigured. If you're going to pair up the freezes as you've > > > > > suggested, we might as well just not call freeze at all. > > > > > > > > blk_mq_update_nr_hw_queues() requires queue to be frozen. > > > > > > It's too late at that point. Let's work through a real example. You'll > > > need a system that has more CPU's than your nvme has IO queues. > > > > > > Boot without any special nvme parameters. Every possible nvme IO queue > > > will be assigned "default" hctx type. Now start IO to every queue, then > > > run: > > > > > > # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller > > > > > > Today, we freeze prior to tearing down the "default" IO queues, so > > > there's nothing entered into them while the driver reconfigures the > > > queues. > > > > nvme_start_freeze() just prevents new IO from being queued, and old ones > > may still be entering block layer queue, and what matters here is > > actually quiesce, which prevents new IO from being queued to > > driver/hardware. > > > > > > > > What you're suggesting will allow IO to queue up in a queisced "default" > > > queue, which will become "polled" without an interrupt hanlder on the > > > other side of the reset. The application doesn't know that, so the IO > > > you're allowing to queue up will time out. > > > > time out only happens after the request is queued to driver/hardware, or after > > blk_mq_start_request() is called in nvme_queue_rq(), but quiesce actually > > prevents new IOs from being dispatched to driver or be queued via .queue_rq(), > > meantime old requests have been canceled, so no any request can be > > timed out. > > Quiesce doesn't prevent requests from entering an hctx, and you can't > back it out to put on another hctx later. It doesn't matter that you > haven't dispatched it to hardware yet. The request's queue was set the > moment it was allocated, so after you unquiesce and freeze for the new > queue mapping, the requests previously blocked on quiesce will time out > in the scenario I've described. > > There are certainly gaps in the existing code where error'ed requests > can be requeued or stuck elsewhere and hit the exact same problem, but > the current way at least tries to contain it. Yeah, but you can't remove the gap at all with start_freeze, that said the current code has to live with the situation of new mapping change and old request with old mapping. Actually I considered to handle this kind of situation before, one approach is to reuse the bio steal logic taken in nvme mpath: 1) for FS IO, re-submit bios, meantime free request 2) for PT request, simply fail it It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is always set for PT request, but not see any better approach for handling PT request. Thanks, Ming