With some CPU numbering schemes, the function select_idle_cpu() currently has a subtle bias to return the first hyper-thread in a core. As a result work is not evenly balanced across hyper-threads in a core. The difference is often as much as 15 to 20 percentage points -- i.e., the first hyper-thread might run at 45% load while the second hyper-thread runs at only 30% load or less. Two likely CPU numbering schemes make sense with today's typical case of two hyper-threads per core: A. Enumerate all the first hyper-theads in a core, then all the second hyper-threads in a core. If a system has 8 cores with 16 hyper-threads, CPUs #0 and #8 are in the same core, as are CPUs #1 and #9, etc. B. Enumerate all hyper-threads in a core, then all hyper-threads in the next core, etc. Again with 8 cores and 16 hyper-threads, CPUs #0 and #1 are in the same core, as are CPUs #2 and #3, etc. Scheme A is used in most ACPI bare metal systems and in VMs running on KVM. The enumeration order is determined by the order of the processor entries in the ACPI MADT, and the ACPI spec *recommends* that the MADT be set up for scheme A. However, for reasons that pre-date the ACPI recommendation, Hyper-V guests have an ACPI MADT that is set up for scheme B. When using scheme B, the subtle bias is evident in select_idle_cpu(). While having Hyper-V conform to the ACPI spec recommendation would solve the Hyper-V problem, it is also desirable for the fair scheduler code to be independent of the CPU numbering scheme. ACPI is not always the firmware configuration mechanism, and CPU numbering schemes might vary more in architectures other than x86/x64. The bias occurs with scheme B when "has_idle_cpu" is true and select_idle_core() is called in the for_each_cpu_wrap() loop. Regardless of where the loop starts, it will almost immediately encountered a CPU that is the first hyper-thread in a core. If that core is idle, the CPU number of that first hyper-thread is returned. If that core is not idle, both hyper-threads are removed from the cpus mask, and the loop iterates to choose another CPU that is the first hyper-thread in a core. As a result, select_idle_core() almost always returns the first hyper-thread in a core. The bias does not occur with scheme A because half of the CPU numbering space is a series of CPUs that are the second hyper-thread in all the cores. Assuming that the "target" CPU is evenly distributed throughout the CPU numbering space, there's a 50/50 chance of starting in the portion of the CPU numbering space that is all second hyper-threads. If select_idle_core() finds a idle core, it will likely return a CPU that is the second hyper-thread in the core. On average over the time, both the first and second hyper-thread are equally likely to be returned. Fix this bias by determining which hyper-thread in a core the "target" CPU is -- i.e., the "smt index" of that CPU. Then when select_idle_core() finds an idle core, it returns the CPU in the core that has the same smt index. If that CPU is not valid to be chosen, just return the CPU that was passed into select_idle_core() and don't worry about bias. With scheme B, this fix solves the bias problem by making the chosen CPU be roughly equally likely to either hyper-thread. With scheme A there's no real effect as the chosen CPU was already equally likely to be either hyper-thread, and still is. The imbalance in hyper-thread loading was originally observed in a customer workload, and then reproduced with a synthetic workload. The change has been tested with the synthetic workload in a Hyper-V VM running the normal scheme B CPU numbering, and then with the MADT replaced with a scheme A version using Linux's ability to override ACPI tables. The testing showed no change hyper-thread loading balance with the scheme A CPU numbering, but the imbalance is corrected if the CPU numbering is scheme B. Signed-off-by: Michael Kelley <mikelley@xxxxxxxxxxxxx> --- I haven't previously worked in Linux scheduler code, so I'm posting this as an RFC to point out the observed problem, and to suggest a solution. There may well be considerations in the design of a solution that I'm not aware of, so please educate me or suggest an alternative. It's also not completely clear whether an imbalance in hyper-thread loading is actually a problem. It looks weird, and causes customer concern when it is observed consistently across all cores in some production workload. The fair scheduler strives to balance load evenly, so I'm treating it as a problem that should be fixed, if for no other reason than general goodness. But again, I'm sure reviewers will feel free to tell me otherwise. :-) The fix takes relatively few CPU cycles, but it's still a non-zero cost. FWIW, the same imbalance has been observed with kernels as far back as 5.4, and the root cause in the code is essentially the same. So it's not a recently introduced issue. I haven't tried anything earlier than 5.4. kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++------ 1 file changed, 30 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 373ff5f..8b56e9d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6832,6 +6832,19 @@ static inline bool test_idle_cores(int cpu) return false; } +static inline int get_smt_index(int core) +{ + int cpu, n = 0; + + for_each_cpu(cpu, cpu_smt_mask(core)) { + if (cpu == core) + return n; + n++; + } + /* If get here, cpu_smt_mask is set up incorrectly */ + return 0; +} + /* * Scans the local SMT mask to see if the entire core is idle, and records this * information in sd_llc_shared->has_idle_cores. @@ -6866,10 +6879,11 @@ void __update_idle_core(struct rq *rq) * there are no idle cores left in the system; tracked through * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above. */ -static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) +static int select_idle_core(struct task_struct *p, int core, int smt_index, + struct cpumask *cpus, int *idle_cpu) { bool idle = true; - int cpu; + int cpu, index_cpu, n = 0; for_each_cpu(cpu, cpu_smt_mask(core)) { if (!available_idle_cpu(cpu)) { @@ -6885,10 +6899,13 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu } if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr)) *idle_cpu = cpu; + + if (n++ == smt_index) + index_cpu = cpu; } if (idle) - return core; + return cpumask_test_cpu(index_cpu, p->cpus_ptr) ? index_cpu : core; cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); return -1; @@ -6922,7 +6939,13 @@ static inline bool test_idle_cores(int cpu) return false; } -static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) +static inline int get_smt_index(int core) +{ + return 0; +} + +static inline int select_idle_core(struct task_struct *p, int core, int smt_index, + struct cpumask *cpus, int *idle_cpu) { return __select_idle_cpu(core, p); } @@ -6942,7 +6965,7 @@ static inline int select_idle_smt(struct task_struct *p, int target) static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); - int i, cpu, idle_cpu = -1, nr = INT_MAX; + int i, cpu, smt_index, idle_cpu = -1, nr = INT_MAX; struct sched_domain_shared *sd_share; struct rq *this_rq = this_rq(); int this = smp_processor_id(); @@ -6994,9 +7017,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool } } + smt_index = get_smt_index(target); for_each_cpu_wrap(cpu, cpus, target + 1) { if (has_idle_core) { - i = select_idle_core(p, cpu, cpus, &idle_cpu); + i = select_idle_core(p, cpu, smt_index, cpus, &idle_cpu); if ((unsigned int)i < nr_cpumask_bits) return i; -- 1.8.3.1