Hi Michael, On Wed, 31 May 2023 at 19:55, Michael Kelley <mikelley@xxxxxxxxxxxxx> wrote: > > With some CPU numbering schemes, the function select_idle_cpu() currently > has a subtle bias to return the first hyper-thread in a core. As a result > work is not evenly balanced across hyper-threads in a core. The difference > is often as much as 15 to 20 percentage points -- i.e., the first > hyper-thread might run at 45% load while the second hyper-thread runs at > only 30% load or less. > > Two likely CPU numbering schemes make sense with today's typical case > of two hyper-threads per core: > > A. Enumerate all the first hyper-theads in a core, then all the second > hyper-threads in a core. If a system has 8 cores with 16 hyper-threads, > CPUs #0 and #8 are in the same core, as are CPUs #1 and #9, etc. > > B. Enumerate all hyper-threads in a core, then all hyper-threads in the > next core, etc. Again with 8 cores and 16 hyper-threads, CPUs #0 and > #1 are in the same core, as are CPUs #2 and #3, etc. > > Scheme A is used in most ACPI bare metal systems and in VMs running on > KVM. The enumeration order is determined by the order of the processor > entries in the ACPI MADT, and the ACPI spec *recommends* that the MADT > be set up for scheme A. > > However, for reasons that pre-date the ACPI recommendation, Hyper-V > guests have an ACPI MADT that is set up for scheme B. When using scheme B, > the subtle bias is evident in select_idle_cpu(). While having Hyper-V > conform to the ACPI spec recommendation would solve the Hyper-V problem, > it is also desirable for the fair scheduler code to be independent of the > CPU numbering scheme. ACPI is not always the firmware configuration > mechanism, and CPU numbering schemes might vary more in architectures > other than x86/x64. > > The bias occurs with scheme B when "has_idle_cpu" is true and I assume that you mean has_idle_core as I can't find has_idle_cpu in the code > select_idle_core() is called in the for_each_cpu_wrap() loop. Regardless > of where the loop starts, it will almost immediately encountered a CPU > that is the first hyper-thread in a core. If that core is idle, the CPU > number of that first hyper-thread is returned. If that core is not idle, > both hyper-threads are removed from the cpus mask, and the loop iterates > to choose another CPU that is the first hyper-thread in a core. As a > result, select_idle_core() almost always returns the first hyper-thread > in a core. > > The bias does not occur with scheme A because half of the CPU numbering > space is a series of CPUs that are the second hyper-thread in all the > cores. Assuming that the "target" CPU is evenly distributed throughout > the CPU numbering space, there's a 50/50 chance of starting in the portion > of the CPU numbering space that is all second hyper-threads. If > select_idle_core() finds a idle core, it will likely return a CPU that > is the second hyper-thread in the core. On average over the time, > both the first and second hyper-thread are equally likely to be > returned. > > Fix this bias by determining which hyper-thread in a core the "target" > CPU is -- i.e., the "smt index" of that CPU. Then when select_idle_core() > finds an idle core, it returns the CPU in the core that has the same > smt index. If that CPU is not valid to be chosen, just return the CPU > that was passed into select_idle_core() and don't worry about bias. > > With scheme B, this fix solves the bias problem by making the chosen > CPU be roughly equally likely to either hyper-thread. With scheme A > there's no real effect as the chosen CPU was already equally likely > to be either hyper-thread, and still is. > > The imbalance in hyper-thread loading was originally observed in a > customer workload, and then reproduced with a synthetic workload. > The change has been tested with the synthetic workload in a Hyper-V VM > running the normal scheme B CPU numbering, and then with the MADT > replaced with a scheme A version using Linux's ability to override > ACPI tables. The testing showed no change hyper-thread loading > balance with the scheme A CPU numbering, but the imbalance is > corrected if the CPU numbering is scheme B. You failed to explain why it's important to evenly select 1st or 2nd hyper-threads on the system. I don't see any performance figures. What would be the benefit ? > > Signed-off-by: Michael Kelley <mikelley@xxxxxxxxxxxxx> > --- > > I haven't previously worked in Linux scheduler code, so I'm posting this > as an RFC to point out the observed problem, and to suggest a solution. > There may well be considerations in the design of a solution that I'm not > aware of, so please educate me or suggest an alternative. > > It's also not completely clear whether an imbalance in hyper-thread > loading is actually a problem. It looks weird, and causes customer > concern when it is observed consistently across all cores in some > production workload. The fair scheduler strives to balance load evenly, so > I'm treating it as a problem that should be fixed, if for no other reason > than general goodness. But again, I'm sure reviewers will feel free to > tell me otherwise. :-) The fix takes relatively few CPU cycles, but it's > still a non-zero cost. > > FWIW, the same imbalance has been observed with kernels as far back as > 5.4, and the root cause in the code is essentially the same. So it's not > a recently introduced issue. I haven't tried anything earlier than 5.4. > > kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++------ > 1 file changed, 30 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 373ff5f..8b56e9d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6832,6 +6832,19 @@ static inline bool test_idle_cores(int cpu) > return false; > } > > +static inline int get_smt_index(int core) > +{ > + int cpu, n = 0; > + > + for_each_cpu(cpu, cpu_smt_mask(core)) { > + if (cpu == core) > + return n; > + n++; > + } > + /* If get here, cpu_smt_mask is set up incorrectly */ > + return 0; > +} > + > /* > * Scans the local SMT mask to see if the entire core is idle, and records this > * information in sd_llc_shared->has_idle_cores. > @@ -6866,10 +6879,11 @@ void __update_idle_core(struct rq *rq) > * there are no idle cores left in the system; tracked through > * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above. > */ > -static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) > +static int select_idle_core(struct task_struct *p, int core, int smt_index, > + struct cpumask *cpus, int *idle_cpu) > { > bool idle = true; > - int cpu; > + int cpu, index_cpu, n = 0; > > for_each_cpu(cpu, cpu_smt_mask(core)) { > if (!available_idle_cpu(cpu)) { > @@ -6885,10 +6899,13 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu > } > if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr)) > *idle_cpu = cpu; > + > + if (n++ == smt_index) > + index_cpu = cpu; > } > > if (idle) > - return core; > + return cpumask_test_cpu(index_cpu, p->cpus_ptr) ? index_cpu : core; > > cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); > return -1; > @@ -6922,7 +6939,13 @@ static inline bool test_idle_cores(int cpu) > return false; > } > > -static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) > +static inline int get_smt_index(int core) > +{ > + return 0; > +} > + > +static inline int select_idle_core(struct task_struct *p, int core, int smt_index, > + struct cpumask *cpus, int *idle_cpu) > { > return __select_idle_cpu(core, p); > } > @@ -6942,7 +6965,7 @@ static inline int select_idle_smt(struct task_struct *p, int target) > static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target) > { > struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); > - int i, cpu, idle_cpu = -1, nr = INT_MAX; > + int i, cpu, smt_index, idle_cpu = -1, nr = INT_MAX; > struct sched_domain_shared *sd_share; > struct rq *this_rq = this_rq(); > int this = smp_processor_id(); > @@ -6994,9 +7017,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > } > } > > + smt_index = get_smt_index(target); > for_each_cpu_wrap(cpu, cpus, target + 1) { > if (has_idle_core) { > - i = select_idle_core(p, cpu, cpus, &idle_cpu); > + i = select_idle_core(p, cpu, smt_index, cpus, &idle_cpu); > if ((unsigned int)i < nr_cpumask_bits) > return i; > > -- > 1.8.3.1 >