Hi, On 2023/8/18 09:52, Ming Lei wrote: > group_cpus_evenly() could be part of storage driver's error handler, > such as nvme driver, when may happen during CPU hotplug, in which > storage queue has to drain its pending IOs because all CPUs associated > with the queue are offline and the queue is becoming inactive. And > handling IO needs error handler to provide forward progress. > > Then dead lock is caused: > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > handler is waiting for inflight IO > > 2) error handler is waiting for CPU hotplug lock > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > error handling can't provide forward progress. > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > in which two stage spreads are taken: 1) the 1st stage is over all present > CPUs; 2) the end stage is over all other CPUs. > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > remove the CPU hotplug lock by storing it into one local cache. This way > doesn't change correctness, because all CPUs are still covered. > > Cc: Keith Busch <kbusch@xxxxxxxxxx> > Cc: linux-nvme@xxxxxxxxxxxxxxxxxxx > Cc: linux-block@xxxxxxxxxxxxxxx > Reported-by: Yi Zhang <yi.zhang@xxxxxxxxxx> > Reported-by: Guangwu Zhang <guazhang@xxxxxxxxxx> > Tested-by: Guangwu Zhang <guazhang@xxxxxxxxxx> > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > --- > V2: > - fix "Cc: block list" > - add tested-by tag > > lib/group_cpus.c | 22 ++++++++++++++++------ > 1 file changed, 16 insertions(+), 6 deletions(-) > > diff --git a/lib/group_cpus.c b/lib/group_cpus.c > index aa3f6815bb12..15006e79196f 100644 > --- a/lib/group_cpus.c > +++ b/lib/group_cpus.c > @@ -348,6 +348,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > { > unsigned int curgrp = 0, nr_present = 0, nr_others = 0; > cpumask_var_t *node_to_cpumask; > + cpumask_var_t local_cpu_present_mask; > cpumask_var_t nmsk, npresmsk; > int ret = -ENOMEM; > struct cpumask *masks = NULL; > @@ -355,6 +356,16 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL)) > return NULL; > > + if (!zalloc_cpumask_var(&local_cpu_present_mask, GFP_KERNEL)) > + goto fail_local_pres_mask; > + > + /* > + * Make a local cache of 'cpu_present_mask', so the two stages > + * spread can observe consistent 'cpu_present_mask' without holding > + * cpu hotplug lock. > + */ > + cpumask_copy(local_cpu_present_mask, cpu_present_mask); > + Maybe we can reuse npresmsk instead of allocating another cpumask? In the first stage: npresmsk = cpu_present_mask In the second stage: npresmsk = cpu_possible_mask & ~npresmsk > if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL)) > goto fail_nmsk; > > @@ -366,13 +377,11 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > if (!masks) > goto fail_node_to_cpumask; > > - /* Stabilize the cpumasks */ > - cpus_read_lock(); > build_node_to_cpumask(node_to_cpumask); > > /* grouping present CPUs first */ > ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, > - cpu_present_mask, nmsk, masks); > + local_cpu_present_mask, nmsk, masks); > if (ret < 0) > goto fail_build_affinity; > nr_present = ret; > @@ -387,15 +396,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > curgrp = 0; > else > curgrp = nr_present; > - cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask); > + cpumask_andnot(npresmsk, cpu_possible_mask, local_cpu_present_mask); > ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, > npresmsk, nmsk, masks); > if (ret >= 0) > nr_others = ret; > > fail_build_affinity: > - cpus_read_unlock(); > - > if (ret >= 0) > WARN_ON(nr_present + nr_others < numgrps); This fail_build_affinity tag seems unneeded anymore. The patch looks good to me: Reviewed-by: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx> Thanks. > > @@ -406,6 +413,9 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > free_cpumask_var(npresmsk); > > fail_nmsk: > + free_cpumask_var(local_cpu_present_mask); > + > + fail_local_pres_mask: > free_cpumask_var(nmsk); > if (ret < 0) { > kfree(masks);