The logic in blk_mq_complete_need_ipi() assumes SMP systems where all CPUs have equal capacities and only LLC cache can make a different on perceived performance. But this assumption falls apart on HMP systems where LLC is shared, but the CPUs have different capacities. Staying local then can have a big performance impact if the IO request was done from a CPU with higher capacity but the interrupt is serviced on a lower capacity CPU. Introduce new cpus_gte_capacity() function to enable do the additional check. Without the patch I see the BLOCK softirq always running on little cores (where the hardirq is serviced). With it I can see it running on all cores. This was noticed after the topology change [1] where now on a big.LITTLE we truly get that the LLC is shared between all cores where as in the past it was being misrepresented for historical reasons. The logic exposed a missing dependency on capacities for such systems where there can be a big performance difference between the CPUs. This of course introduced a noticeable change in behavior depending on how the topology is presented. Leading to regressions in some workloads as the performance of the BLOCK softirq on littles can be noticeably worse. [1] https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf Signed-off-by: Qais Yousef (Google) <qyousef@xxxxxxxxxxx> --- block/blk-mq.c | 5 +++-- include/linux/sched/topology.h | 6 ++++++ kernel/sched/core.c | 8 ++++++++ 3 files changed, 17 insertions(+), 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index ac18f802c027..9b2d278a7ae7 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1163,10 +1163,11 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq) if (force_irqthreads()) return false; - /* same CPU or cache domain? Complete locally */ + /* same CPU or cache domain and capacity? Complete locally */ if (cpu == rq->mq_ctx->cpu || (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) && - cpus_share_cache(cpu, rq->mq_ctx->cpu))) + cpus_share_cache(cpu, rq->mq_ctx->cpu) && + cpus_gte_capacity(cpu, rq->mq_ctx->cpu))) return false; /* don't try to IPI to an offline CPU */ diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index a6e04b4a21d7..31cef5780ba4 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -176,6 +176,7 @@ extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[], cpumask_var_t *alloc_sched_domains(unsigned int ndoms); void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms); +bool cpus_gte_capacity(int this_cpu, int that_cpu); bool cpus_share_cache(int this_cpu, int that_cpu); bool cpus_share_resources(int this_cpu, int that_cpu); @@ -226,6 +227,11 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[], { } +static inline bool cpus_gte_capacity(int this_cpu, int that_cpu) +{ + return true; +} + static inline bool cpus_share_cache(int this_cpu, int that_cpu) { return true; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index db4be4921e7f..db5ab4b3cee7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3954,6 +3954,14 @@ void wake_up_if_idle(int cpu) } } +bool cpus_gte_capacity(int this_cpu, int that_cpu) +{ + if (this_cpu == that_cpu) + return true; + + return arch_scale_cpu_capacity(this_cpu) >= arch_scale_cpu_capacity(that_cpu); +} + bool cpus_share_cache(int this_cpu, int that_cpu) { if (this_cpu == that_cpu) -- 2.34.1