From: Vincent Guittot <vincent.guittot@xxxxxxxxxx> Sent: Monday, June 5, 2023 2:31 AM > > Hi Michael, > > On Wed, 31 May 2023 at 19:55, Michael Kelley <mikelley@xxxxxxxxxxxxx> wrote: > > > > With some CPU numbering schemes, the function select_idle_cpu() currently > > has a subtle bias to return the first hyper-thread in a core. As a result > > work is not evenly balanced across hyper-threads in a core. The difference > > is often as much as 15 to 20 percentage points -- i.e., the first > > hyper-thread might run at 45% load while the second hyper-thread runs at > > only 30% load or less. > > > > Two likely CPU numbering schemes make sense with today's typical case > > of two hyper-threads per core: > > > > A. Enumerate all the first hyper-theads in a core, then all the second > > hyper-threads in a core. If a system has 8 cores with 16 hyper-threads, > > CPUs #0 and #8 are in the same core, as are CPUs #1 and #9, etc. > > > > B. Enumerate all hyper-threads in a core, then all hyper-threads in the > > next core, etc. Again with 8 cores and 16 hyper-threads, CPUs #0 and > > #1 are in the same core, as are CPUs #2 and #3, etc. > > > > Scheme A is used in most ACPI bare metal systems and in VMs running on > > KVM. The enumeration order is determined by the order of the processor > > entries in the ACPI MADT, and the ACPI spec *recommends* that the MADT > > be set up for scheme A. > > > > However, for reasons that pre-date the ACPI recommendation, Hyper-V > > guests have an ACPI MADT that is set up for scheme B. When using scheme B, > > the subtle bias is evident in select_idle_cpu(). While having Hyper-V > > conform to the ACPI spec recommendation would solve the Hyper-V problem, > > it is also desirable for the fair scheduler code to be independent of the > > CPU numbering scheme. ACPI is not always the firmware configuration > > mechanism, and CPU numbering schemes might vary more in architectures > > other than x86/x64. > > > > The bias occurs with scheme B when "has_idle_cpu" is true and > > I assume that you mean has_idle_core as I can't find has_idle_cpu in the code Yes. You are right. > > > select_idle_core() is called in the for_each_cpu_wrap() loop. Regardless > > of where the loop starts, it will almost immediately encountered a CPU > > that is the first hyper-thread in a core. If that core is idle, the CPU > > number of that first hyper-thread is returned. If that core is not idle, > > both hyper-threads are removed from the cpus mask, and the loop iterates > > to choose another CPU that is the first hyper-thread in a core. As a > > result, select_idle_core() almost always returns the first hyper-thread > > in a core. > > > > The bias does not occur with scheme A because half of the CPU numbering > > space is a series of CPUs that are the second hyper-thread in all the > > cores. Assuming that the "target" CPU is evenly distributed throughout > > the CPU numbering space, there's a 50/50 chance of starting in the portion > > of the CPU numbering space that is all second hyper-threads. If > > select_idle_core() finds a idle core, it will likely return a CPU that > > is the second hyper-thread in the core. On average over the time, > > both the first and second hyper-thread are equally likely to be > > returned. > > > > Fix this bias by determining which hyper-thread in a core the "target" > > CPU is -- i.e., the "smt index" of that CPU. Then when select_idle_core() > > finds an idle core, it returns the CPU in the core that has the same > > smt index. If that CPU is not valid to be chosen, just return the CPU > > that was passed into select_idle_core() and don't worry about bias. > > > > With scheme B, this fix solves the bias problem by making the chosen > > CPU be roughly equally likely to either hyper-thread. With scheme A > > there's no real effect as the chosen CPU was already equally likely > > to be either hyper-thread, and still is. > > > > The imbalance in hyper-thread loading was originally observed in a > > customer workload, and then reproduced with a synthetic workload. > > The change has been tested with the synthetic workload in a Hyper-V VM > > running the normal scheme B CPU numbering, and then with the MADT > > replaced with a scheme A version using Linux's ability to override > > ACPI tables. The testing showed no change hyper-thread loading > > balance with the scheme A CPU numbering, but the imbalance is > > corrected if the CPU numbering is scheme B. > > You failed to explain why it's important to evenly select 1st or 2nd > hyper-threads on the system. I don't see any performance figures. > What would be the benefit ? As I noted below, it's not completely clear to me whether this is a problem. I don't have a specific workload where overall runtime is longer due to the SMT level imbalance. I'm not a scheduler expert, and wanted input from those who are. Linux generally does a good job of balancing load, and given the code in fair.c that is devoted to balancing, I inferred at least some importance. But maybe balancing is more important for the higher-level domains, and less so for the SMT domain. >From a slightly different perspective, sysadmins are accustomed to 'htop' or whatever tools showing balanced load. This is a case where the load is persistently unbalanced, and as such it drew attention. And at a slightly theoretical level, it *is* interesting that the CPU numbering scheme affects the outcome of scheduler decisions in a noticeable way. But if the sense is that an SMT-level imbalance really doesn't affect anything, we'll live with CPU numbering scheme B being a bit of an anomaly. When necessary, we can explain that it doesn't have any negative effects. Michael > > > > > Signed-off-by: Michael Kelley <mikelley@xxxxxxxxxxxxx> > > --- > > > > I haven't previously worked in Linux scheduler code, so I'm posting this > > as an RFC to point out the observed problem, and to suggest a solution. > > There may well be considerations in the design of a solution that I'm not > > aware of, so please educate me or suggest an alternative. > > > > It's also not completely clear whether an imbalance in hyper-thread > > loading is actually a problem. It looks weird, and causes customer > > concern when it is observed consistently across all cores in some > > production workload. The fair scheduler strives to balance load evenly, so > > I'm treating it as a problem that should be fixed, if for no other reason > > than general goodness. But again, I'm sure reviewers will feel free to > > tell me otherwise. :-) The fix takes relatively few CPU cycles, but it's > > still a non-zero cost. > > > > FWIW, the same imbalance has been observed with kernels as far back as > > 5.4, and the root cause in the code is essentially the same. So it's not > > a recently introduced issue. I haven't tried anything earlier than 5.4. > > > > kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++------ > > 1 file changed, 30 insertions(+), 6 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 373ff5f..8b56e9d 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6832,6 +6832,19 @@ static inline bool test_idle_cores(int cpu) > > return false; > > } > > > > +static inline int get_smt_index(int core) > > +{ > > + int cpu, n = 0; > > + > > + for_each_cpu(cpu, cpu_smt_mask(core)) { > > + if (cpu == core) > > + return n; > > + n++; > > + } > > + /* If get here, cpu_smt_mask is set up incorrectly */ > > + return 0; > > +} > > + > > /* > > * Scans the local SMT mask to see if the entire core is idle, and records this > > * information in sd_llc_shared->has_idle_cores. > > @@ -6866,10 +6879,11 @@ void __update_idle_core(struct rq *rq) > > * there are no idle cores left in the system; tracked through > > * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above. > > */ > > -static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int > *idle_cpu) > > +static int select_idle_core(struct task_struct *p, int core, int smt_index, > > + struct cpumask *cpus, int *idle_cpu) > > { > > bool idle = true; > > - int cpu; > > + int cpu, index_cpu, n = 0; > > > > for_each_cpu(cpu, cpu_smt_mask(core)) { > > if (!available_idle_cpu(cpu)) { > > @@ -6885,10 +6899,13 @@ static int select_idle_core(struct task_struct *p, int core, > struct cpumask *cpu > > } > > if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr)) > > *idle_cpu = cpu; > > + > > + if (n++ == smt_index) > > + index_cpu = cpu; > > } > > > > if (idle) > > - return core; > > + return cpumask_test_cpu(index_cpu, p->cpus_ptr) ? index_cpu : core; > > > > cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); > > return -1; > > @@ -6922,7 +6939,13 @@ static inline bool test_idle_cores(int cpu) > > return false; > > } > > > > -static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, > int *idle_cpu) > > +static inline int get_smt_index(int core) > > +{ > > + return 0; > > +} > > + > > +static inline int select_idle_core(struct task_struct *p, int core, int smt_index, > > + struct cpumask *cpus, int *idle_cpu) > > { > > return __select_idle_cpu(core, p); > > } > > @@ -6942,7 +6965,7 @@ static inline int select_idle_smt(struct task_struct *p, int > target) > > static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > has_idle_core, int target) > > { > > struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); > > - int i, cpu, idle_cpu = -1, nr = INT_MAX; > > + int i, cpu, smt_index, idle_cpu = -1, nr = INT_MAX; > > struct sched_domain_shared *sd_share; > > struct rq *this_rq = this_rq(); > > int this = smp_processor_id(); > > @@ -6994,9 +7017,10 @@ static int select_idle_cpu(struct task_struct *p, struct > sched_domain *sd, bool > > } > > } > > > > + smt_index = get_smt_index(target); > > for_each_cpu_wrap(cpu, cpus, target + 1) { > > if (has_idle_core) { > > - i = select_idle_core(p, cpu, cpus, &idle_cpu); > > + i = select_idle_core(p, cpu, smt_index, cpus, &idle_cpu); > > if ((unsigned int)i < nr_cpumask_bits) > > return i; > > > > -- > > 1.8.3.1 > >