Hi Prateek, On 2023-07-13 at 09:13:29 +0530, K Prateek Nayak wrote: > Hello Chenyu, > > > > > Tested on Sapphire Rapids, which has 2 x 56C/112T and 224 CPUs in total. C-states > > deeper than C1E are disabled. Turbo is disabled. CPU frequency governor is performance. > > > > The baseline is v6.4-rc1 tip:sched/core, on top of > > commit 637c9509f3db ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") > > > > patch0: this SD_IDLE_SIBLING patch with above change to TOPOLOGY_SD_FLAGS > > patch1: hack patch to split 1 LLC domain into 4 smaller LLC domains(with some fixes on top of > > https://lore.kernel.org/lkml/ZJKjvx%2FNxooM5z1Y@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/) > > The test data in above link is invalid due to bugs in the hack patch, fixed in this version) > > > > > > Baseline vs Baseline+patch0: > > There is no much difference between the two, and it is expected because Sapphire Rapids > > does not have multiple LLC domains within 1 Numa node(also consider the run to run variation): > > [snip] > > > > Baseline+patch1 vs Baseline+patch0+patch1: > > > > With multiple LLC domains in 1 Numa node, SD_IDLE_SIBLING brings improvement > > to hackbench/schbench, while brings downgrading to netperf/tbench. This is aligned > > with what was observed previously, if the waker and wakee wakes up each other > > frequently, they would like to be put together for cache locality. While for > > other tasks do not have shared resource, always choosing an idle CPU is better. > > Maybe in the future we can look back at SIS_SHORT and terminates scan in > > select_idle_node() if the waker and wakee have close relationship with > > each other. > > Gautham and I were discussing this and realized that when calling > ttwu_queue_wakelist(), in a simulated split-LLC case, ttwu_queue_cond() > will recommend using the wakelist and send an IPI despite the > groups of the DIE domain sharing the cache in your case. > > Can you check if the following change helps the regression? > (Note: Completely untested and there may be other such cases lurking > around that we've not yet considered) > Good point. There are quite some cpus_share_cache() in the code, and it could behave differently if simulated split-LLC is enabled. For example, the chance to choose a previous CPU, or a recent_used_cpu is lower in select_idle_sibling(), because the range of cpus_share_cache() shrinks. I launched netperf(224 threads) and hackbench (2 groups) with below patch applied, it seems there was no much difference(consider the run-to-run variation) patch2: the cpus_share_cache() change below. Baseline+patch1 vs Baseline+patch0+patch1+patch2: netperf ======= case load baseline(std%) compare%( std%) TCP_RR 224-threads 1.00 ( 2.36) -0.19 ( 2.30) hackbench ========= case load baseline(std%) compare%( std%) process-pipe 2-groups 1.00 ( 4.78) -6.28 ( 9.42) > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a68d1276bab0..a8cab1c81aca 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3929,7 +3929,7 @@ static inline bool ttwu_queue_cond(struct task_struct *p, int cpu) > * If the CPU does not share cache, then queue the task on the > * remote rqs wakelist to avoid accessing remote data. > */ > - if (!cpus_share_cache(smp_processor_id(), cpu)) > + if (cpu_to_node(smp_processor_id()) != cpu_to_node(cpu)) > return true; > > if (cpu == smp_processor_id()) > -- > Then I did a hack patch3 in select_idle_node(), to put C/S 1:1 wakeup workloads together. For netperf, it is a 1:1 waker/wakee relationship, for hackbench, it is 1:16 waker/wakee by default(verified by bpftrace). patch3: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5904da690f59..3bdfbd546f14 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7161,6 +7161,11 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target) if (!parent || parent->flags & SD_NUMA) return -1; + /* Tasks pair should be put on local LLC as much as possible. */ + if (current->last_wakee == p && p->last_wakee == current && + !current->wakee_flips && !p->wakee_flips) + return -1; + sg = parent->groups; do { int cpu = cpumask_first(sched_group_span(sg)); -- 2.25.1 Baseline+patch1 vs Baseline+patch0+patch1+patch3: netperf ======= case load baseline(std%) compare%( std%) TCP_RR 224-threads 1.00 ( 2.36) +804.31 ( 2.88) hackbench ========= case load baseline(std%) compare%( std%) process-pipe 2-groups 1.00 ( 4.78) -6.28 ( 6.69) It brings the performance of netperf back, while more or less keeps the improvment of hackbench(consider the run-run variance). thanks, Chenyu