On Mon, 20 Nov 2023 12:00:59 -0800 Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > On Mon, 20 Nov 2023 16:35:59 +0800 Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > > group_cpus_evenly() could be part of storage driver's error handler, > > such as nvme driver, when may happen during CPU hotplug, in which > > storage queue has to drain its pending IOs because all CPUs associated > > with the queue are offline and the queue is becoming inactive. And > > handling IO needs error handler to provide forward progress. > > > > Then dead lock is caused: > > > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > > handler is waiting for inflight IO > > > > 2) error handler is waiting for CPU hotplug lock > > > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > > error handling can't provide forward progress. > > > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > > in which two stage spreads are taken: 1) the 1st stage is over all present > > CPUs; 2) the end stage is over all other CPUs. > > > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > > remove the CPU hotplug lock by storing it into one local cache. This way > > doesn't change correctness, because all CPUs are still covered. > > I'm not sure what is the intended merge path for this, but I can do lib/. > > Do you think that a -stable backport is needed? It sounds that way. > > If so, are we able to identify a suitable Fixes: target? That would > predate f7b3ea8cf72f3 ("genirq/affinity: Move group_cpus_evenly() into > lib/"). No? I think this predates 428e211641ed8 ("genirq/affinity: Replace deprecated CPU-hotplug functions." also. I'll slap a cc:stable on it and I'll let you and the -stable maintainers figure it out.