On Thu, Sep 22, 2022 at 08:25:17AM +0200, Christoph Hellwig wrote: > On Tue, Sep 20, 2022 at 10:17:24AM +0800, Ming Lei wrote: > > For avoiding to trigger io timeout when one hctx becomes inactive, we > > drain IOs when all CPUs of one hctx are offline. However, driver's > > timeout handler may require cpus_read_lock, such as nvme-pci, > > pci_alloc_irq_vectors_affinity() is called in nvme-pci reset context, > > and irq_build_affinity_masks() needs cpus_read_lock(). > > > > Meantime when blk-mq's cpuhp offline handler is called, cpus_write_lock > > is held, so deadlock is caused. > > > > Fixes the issue by breaking the wait loop if enough long time elapses, > > and these in-flight not drained IO still can be handled by timeout > > handler. > > I'm not sure that this actually is a good idea on its own, and it kinda > defeats the cpu hotplug processing. > > So if I understand your log above correctly the problem is that > we have commands that would time out, and we exacalate to a > controller reset that is racing with the CPU unplug. Yes. blk_mq_hctx_notify_offline() is waiting for inflight requests, then cpu_write_lock() is held since it is cpuhp code path. Meantime nvme reset grabs dev->shutdown_lock, then calls pci_alloc_irq_vectors_affinity()->irq_build_affinity_masks() which is waiting for cpu_read_lock(). Meantime nvme_dev_disable() can't move on for handling any io timeout because dev->shutdown_lock is held by nvme reset. Then in-flight IO can't be drained by blk_mq_hctx_notify_offline() One real IO deadlock between cpuhp and nvme_reset. thanks, Ming