> -----Original Message----- > From: Song Bao Hua (Barry Song) > Sent: Friday, March 19, 2021 5:16 PM > To: tim.c.chen@xxxxxxxxxxxxxxx; catalin.marinas@xxxxxxx; will@xxxxxxxxxx; > rjw@xxxxxxxxxxxxx; vincent.guittot@xxxxxxxxxx; bp@xxxxxxxxx; > tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx; lenb@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; > dietmar.eggemann@xxxxxxx; rostedt@xxxxxxxxxxx; bsegall@xxxxxxxxxx; > mgorman@xxxxxxx > Cc: msys.mizuma@xxxxxxxxx; valentin.schneider@xxxxxxx; > gregkh@xxxxxxxxxxxxxxxxxxx; Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>; > juri.lelli@xxxxxxxxxx; mark.rutland@xxxxxxx; sudeep.holla@xxxxxxx; > aubrey.li@xxxxxxxxxxxxxxx; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; > linux-kernel@xxxxxxxxxxxxxxx; linux-acpi@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx; > xuwei (O) <xuwei5@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; > guodong.xu@xxxxxxxxxx; yangyicong <yangyicong@xxxxxxxxxx>; Liguozhu (Kenneth) > <liguozhu@xxxxxxxxxxxxx>; linuxarm@xxxxxxxxxxxxx; hpa@xxxxxxxxx; Song Bao Hua > (Barry Song) <song.bao.hua@xxxxxxxxxxxxx> > Subject: [RFC PATCH v5 3/4] scheduler: scan idle cpu in cluster before scanning > the whole llc > > On kunpeng920, cpus within one cluster can communicate wit each other > much faster than cpus across different clusters. A simple hackbench > can prove that. > hackbench running on 4 cpus in single one cluster and 4 cpus in > different clusters shows a large contrast: > (1) within a cluster: > root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 20000 -g 1 > Running in threaded mode with 1 groups using 40 file descriptors each > (== 40 tasks) > Each sender will pass 20000 messages of 100 bytes > Time: 4.285 > > (2) across clusters: > root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 20000 -g 1 > Running in threaded mode with 1 groups using 40 file descriptors each > (== 40 tasks) > Each sender will pass 20000 messages of 100 bytes > Time: 5.524 > > This inspires us to change the wake_affine path to scan cluster before > scanning the whole LLC to try to gatter related tasks in one cluster, > which is done by this patch. > > To evaluate the performance impact to related tasks talking with each > other, we run the below hackbench with different -g parameter from 2 > to 14, for each different g, we run the command 10 times and get the > average time: > $ numactl -N 0 hackbench -p -T -l 20000 -g $1 > > hackbench will report the time which is needed to complete a certain number > of messages transmissions between a certain number of tasks, for example: > $ numactl -N 0 hackbench -p -T -l 20000 -g 10 > Running in threaded mode with 10 groups using 40 file descriptors each > (== 400 tasks) > Each sender will pass 20000 messages of 100 bytes > > The below is the result of hackbench w/ and w/o cluster patch: > g= 2 4 6 8 10 12 14 > w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929 > w/ : 1.7881 3.7371 5.3301 6.9747 8.6909 9.9235 11.2608 > > Obviously some recent commits have improved the hackbench. So the change > in wake_affine path brings less increase on hackbench compared to what > we got in RFC v4. > And obviously it is much more tricky to leverage wake_affine compared to > leveraging the scatter of tasks in the previous patch as load balance > might pull tasks which have been compact in a cluster so alternative > suggestions welcome. > > In order to figure out how many times cpu is picked from the cluster and > how many times cpu is picked out of the cluster, a tracepoint for debug > purpose is added in this patch. And an userspace bcc script to print the > histogram of the result of select_idle_cpu(): > #!/usr/bin/python > # > # selectidlecpu.py select idle cpu histogram. > # > # A Ctrl-C will print the gathered histogram then exit. > # > # 18-March-2021 Barry Song Created this. > > from __future__ import print_function > from bcc import BPF > from time import sleep > > # load BPF program > b = BPF(text=""" > > BPF_HISTOGRAM(dist); > > TRACEPOINT_PROBE(sched, sched_select_idle_cpu) > { > u32 e; > if (args->idle / 4 == args->target/4) > e = 0; /* idle cpu from same cluster */ Oops here, as -1/4 = 1/4 = 2/4 = 3/4 = 0 So a part of -1 is put here(local cluster) incorrectly. > else if (args->idle != -1) > e = 1; /* idle cpu from different clusters */ > else > e = 2; /* no idle cpu */ > > dist.increment(e); > return 0; > } > """) Fixed it to: TRACEPOINT_PROBE(sched, sched_select_idle_cpu) { u32 e; if (args->idle == -1) e = 2; /* no idle cpu */ else if (args->idle / 4 == args->target / 4) e = 0; /* idle cpu from same cluster */ else e = 1; /* idle cpu from different clusters */ dist.increment(e); return 0; } > > # header > print("Tracing... Hit Ctrl-C to end.") > > # trace until Ctrl-C > try: > sleep(99999999) > except KeyboardInterrupt: > print() > > # output > > print("\nlinear histogram") > print("~~~~~~~~~~~~~~~~") > b["dist"].print_linear_hist("idle") > > Even while g=14 and the system is quite busy, we can see there are some > chances idle cpu is picked from local cluster: > linear histogram > ~~~~~~~~~~~~~~ > idle : count distribution > 0 : 15234281 |*********** | > 1 : 18494 | | > 2 : 53066152 |****************************************| > > 0: local cluster > 1: out of the cluster > 2: select_idle_cpu() returns -1 After fixing the script, the new histogram is like: linear histogram ~~~~~~~~~~~~~~~~ idle : count distribution 0 : 2765930 |* | 1 : 68934 | | 2 : 77667475 |****************************************| We get Local cluster: 3.4358% Out of cluster: 0.0856% -1(no idle before nr becomes 0): 96.4785% > > Signed-off-by: Barry Song <song.bao.hua@xxxxxxxxxxxxx> > --- > include/trace/events/sched.h | 22 ++++++++++++++++++++++ > kernel/sched/fair.c | 32 +++++++++++++++++++++++++++++++- > 2 files changed, 53 insertions(+), 1 deletion(-) > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > index cbe3e15..86608cf 100644 > --- a/include/trace/events/sched.h > +++ b/include/trace/events/sched.h > @@ -136,6 +136,28 @@ > ); > > /* > + * Tracepoint for select_idle_cpu: > + */ > +TRACE_EVENT(sched_select_idle_cpu, > + > + TP_PROTO(int target, int idle), > + > + TP_ARGS(target, idle), > + > + TP_STRUCT__entry( > + __field( int, target ) > + __field( int, idle ) > + ), > + > + TP_fast_assign( > + __entry->target = target; > + __entry->idle = idle; > + ), > + > + TP_printk("target=%d idle=%d", __entry->target, __entry->idle) > +); > + > +/* > * Tracepoint for waking up a task: > */ > DECLARE_EVENT_CLASS(sched_wakeup_template, > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index c92ad9f2..3892d42 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6150,7 +6150,12 @@ static int select_idle_cpu(struct task_struct *p, struct > sched_domain *sd, int t > if (!this_sd) > return -1; > > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > + if (!sched_cluster_active()) > + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > +#ifdef CONFIG_SCHED_CLUSTER > + if (sched_cluster_active()) > + cpumask_and(cpus, cpu_cluster_mask(target), p->cpus_ptr); > +#endif > > if (sched_feat(SIS_PROP) && !smt) { > u64 avg_cost, avg_idle, span_avg; > @@ -6171,6 +6176,29 @@ static int select_idle_cpu(struct task_struct *p, struct > sched_domain *sd, int t > time = cpu_clock(this); > } > > +#ifdef CONFIG_SCHED_CLUSTER > + if (sched_cluster_active()) { > + for_each_cpu_wrap(cpu, cpus, target) { > + if (smt) { > + i = select_idle_core(p, cpu, cpus, &idle_cpu); > + if ((unsigned int)i < nr_cpumask_bits) > + return i; > + > + } else { > + if (!--nr) > + return -1; > + idle_cpu = __select_idle_cpu(cpu); > + if ((unsigned int)idle_cpu < nr_cpumask_bits) { > + goto done; > + } > + } BTW, if I return -1 here directly and don't fall back to LLC, I can even get a better benchmark: g= 2 4 6 8 10 12 14 w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929 w/ : 1.7881 3.7371 5.3301 6.9747 8.6909 9.9235 11.2608 return -1: 1.8324 3.6140 5.1029 6.5016 8.1867 9.7559 10.7716 so it seems the wake-up path change is much more trivial to get a real and good impact, comparing to the previous 2/4 patch in which we are only spreading tasks. > + } > + > + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > + cpumask_andnot(cpus, cpus, cpu_cluster_mask(target)); > + } > +#endif > + > for_each_cpu_wrap(cpu, cpus, target) { > if (smt) { > i = select_idle_core(p, cpu, cpus, &idle_cpu); > @@ -6186,6 +6214,7 @@ static int select_idle_cpu(struct task_struct *p, struct > sched_domain *sd, int t > } > } > > +done: > if (smt) > set_idle_cores(this, false); > > @@ -6324,6 +6353,7 @@ static int select_idle_sibling(struct task_struct *p, > int prev, int target) > return target; > > i = select_idle_cpu(p, sd, target); > + trace_sched_select_idle_cpu(target, i); > if ((unsigned)i < nr_cpumask_bits) > return i; > > -- > 1.8.3.1 Thanks Barry