Hello Peter, Sharing some initial benchmark results with the patch below. tl;dr - Hackbench starts off well but performance drops as the number of groups increases. - schbench (old), tbench, netperf see improvement but there is a band of outlier results when system is fully loaded or slightly overloaded. - Stream and ycsb-mongodb are don't mind the extra search. - SPECjbb (with default scheduler tunables) and DeathStarBench are not very happy. On 5/31/2023 5:34 PM, tip-bot2 for Peter Zijlstra wrote: > The following commit has been merged into the sched/core branch of tip: > > Commit-ID: c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab > Gitweb: https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab > Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > AuthorDate: Tue, 30 May 2023 13:20:46 +02:00 > Committer: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > CommitterDate: Tue, 30 May 2023 22:46:27 +02:00 > > sched/fair: Multi-LLC select_idle_sibling() > > Tejun reported that when he targets workqueues towards a specific LLC > on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets > significant idle time. > > This is, of course, because of how select_idle_sibling() will not > consider anything outside of the local LLC, and since all these tasks > are short running the periodic idle load balancer is ineffective. > > And while it is good to keep work cache local, it is better to not > have significant idle time. Therefore, have select_idle_sibling() try > other LLCs inside the same node when the local one comes up empty. Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T) running in NPS1 mode. Following it the simplified machine topology: NPS1: Each socket is a NUMA node. Total 2 NUMA nodes in the dual socket machine. DIE0: 0-63, 128-191 MC0: 0-7, 128-135 SMT0: 0,128 SMT1: 1,129 ... SMT7: 7,135 MC1: 8-15, 136-143 SMT8: 8,136 SMT9: 9,137 ... SMT15: 15,143 ... MC7: 56-63, 184-191 SMT56: 56,184 SMT57: 57,185 ... SMT63: 63,191 DIE1: 64-127, 192-255 MC8: 64-71, 192-199 SMT64: 64,192 SMT65: 65,193 ... SMT71: 71,199 MC9: 72-79, 200-207 SMT72: 72,200 SMT73: 72,201 ... SMT79: 79,207 ... MC15: 120-127, 248-255 SMT120: 120,248 SMT121: 121,249 ... SMT127: 127,255 Since the patch extends the idle CPU search to one domain above MC in case of on an unsuccessful search, for the above topology, the DIE domain becomes the wake domain with potential 128CPUs to be searched. Following are the benchmark results: o Kernel Versions - tip - tip:sched/core at commit e2a1f85bf9f5 "sched/psi: Avoid resetting the min update period when it is unnecessary") - peter-next-level - tip:sched/core + this patch o Benchmark Results Note: Benchmarks were run with boost enabled and C2 disabled to minimize other external fact. ~~~~~~~~~~~~~ ~ hackbench ~ ~~~~~~~~~~~~~ o NPS1 Test: tip peter-next-level 1-groups: 3.92 (0.00 pct) 4.05 (-3.31 pct) 2-groups: 4.58 (0.00 pct) 3.84 (16.15 pct) 4-groups: 4.99 (0.00 pct) 3.98 (20.24 pct) 8-groups: 5.67 (0.00 pct) 6.05 (-6.70 pct) * Overloaded 16-groups: 7.88 (0.00 pct) 10.56 (-34.01 pct) * Overloaded ~~~~~~~~~~~~~~~~~~ ~ schbench (Old) ~ ~~~~~~~~~~~~~~~~~~ o NPS1 #workers: tip peter-next-level 1: 26.00 (0.00 pct) 24.00 (7.69 pct) 2: 27.00 (0.00 pct) 24.00 (11.11 pct) 4: 31.00 (0.00 pct) 28.00 (9.67 pct) 8: 36.00 (0.00 pct) 33.00 (8.33 pct) 16: 49.00 (0.00 pct) 47.00 (4.08 pct) 32: 80.00 (0.00 pct) 81.00 (-1.25 pct) 64: 169.00 (0.00 pct) 169.00 (0.00 pct) 128: 343.00 (0.00 pct) 365.00 (-6.41 pct) * Fully Loaded 256: 42048.00 (0.00 pct) 35392.00 (15.82 pct) 512: 95104.00 (0.00 pct) 88704.00 (6.72 pct) ~~~~~~~~~~ ~ tbench ~ ~~~~~~~~~~ o NPS1 Clients: tip peter-next-level 1 452.49 (0.00 pct) 457.94 (1.20 pct) 2 862.44 (0.00 pct) 879.99 (2.03 pct) 4 1604.27 (0.00 pct) 1618.87 (0.91 pct) 8 2966.77 (0.00 pct) 3040.90 (2.49 pct) 16 5176.70 (0.00 pct) 5292.29 (2.23 pct) 32 8205.24 (0.00 pct) 8949.12 (9.06 pct) 64 13956.71 (0.00 pct) 14461.42 (3.61 pct) 128 24005.50 (0.00 pct) 26052.75 (8.52 pct) 256 32457.61 (0.00 pct) 21999.41 (-32.22 pct) * Overloaded 512 34345.24 (0.00 pct) 41166.39 (19.86 pct) 1024 33432.92 (0.00 pct) 40900.84 (22.33 pct) ~~~~~~~~~~ ~ stream ~ ~~~~~~~~~~ o NPS1 - 10 Runs: Test: tip peter-next-level Copy: 271317.35 (0.00 pct) 292440.22 (7.78 pct) Scale: 205533.77 (0.00 pct) 203362.60 (-1.05 pct) Add: 221624.62 (0.00 pct) 225850.83 (1.90 pct) Triad: 228500.68 (0.00 pct) 225885.25 (-1.14 pct) - 100 Runs: Test: tip peter-next-level Copy: 317381.65 (0.00 pct) 318827.08 (0.45 pct) Scale: 214145.00 (0.00 pct) 206213.69 (-3.70 pct) Add: 239243.29 (0.00 pct) 229791.67 (-3.95 pct) Triad: 249477.76 (0.00 pct) 236843.06 (-5.06 pct) ~~~~~~~~~~~~~~~~~~~~ ~ netperf - TCP_RR ~ ~~~~~~~~~~~~~~~~~~~~ o NPS1 Test: tip peter-next-level 1-clients: 102839.97 (0.00 pct) 103540.33 (0.68 pct) 2-clients: 98428.08 (0.00 pct) 100431.67 (2.03 pct) 4-clients: 92298.45 (0.00 pct) 94800.51 (2.71 pct) 8-clients: 85618.41 (0.00 pct) 89130.14 (4.10 pct) 16-clients: 78722.18 (0.00 pct) 79715.38 (1.26 pct) 32-clients: 73610.75 (0.00 pct) 72801.41 (-1.09 pct) 64-clients: 55285.07 (0.00 pct) 56184.38 (1.62 pct) 128-clients: 31176.92 (0.00 pct) 32830.06 (5.30 pct) 256-clients: 20011.44 (0.00 pct) 15135.39 (-24.36 pct) * Overloaded ~~~~~~~~~~~~~ ~ unixbench ~ ~~~~~~~~~~~~~ o NPS1 tip peter-next-level Hmean unixbench-dhry2reg-1 41322625.19 ( 0.00%) 41224388.33 ( -0.24%) Hmean unixbench-dhry2reg-512 6252491108.60 ( 0.00%) 6240160851.68 ( -0.20%) Amean unixbench-syscall-1 2501398.27 ( 0.00%) 2577323.43 * -3.04%* Amean unixbench-syscall-512 8120524.00 ( 0.00%) 7512955.87 * 7.48%* Hmean unixbench-pipe-1 2359346.02 ( 0.00%) 2392308.62 * 1.40%* Hmean unixbench-pipe-512 338790322.61 ( 0.00%) 337711432.92 ( -0.32%) Hmean unixbench-spawn-1 4261.52 ( 0.00%) 4164.90 ( -2.27%) Hmean unixbench-spawn-512 64328.93 ( 0.00%) 62257.64 * -3.22%* Hmean unixbench-execl-1 3677.73 ( 0.00%) 3652.08 ( -0.70%) Hmean unixbench-execl-512 11984.83 ( 0.00%) 13585.65 * 13.36%* ~~~~~~~~~~~~~~~~ ~ ycsb-mongodb ~ ~~~~~~~~~~~~~~~~ o NPS1 tip: 131070.33 (var: 2.84%) peter-next-level: 131070.33 (var: 2.84%) (0.00%) ~~~~~~~~~~~~~~~~~~~~~~~ ~ SPECjbb - Multi-JVM ~ ~~~~~~~~~~~~~~~~~~~~~~~ o NPS1 - Default Scheduler Tunables kernel max-jOPS critical-jOPS tip 100.00% 100.00% peter-next-level 94.45% (-5.55%) 98.25% (-1.75%) - Modified Scheduler Tunables kernel max-jOPS critical-jOPS tip 100.00% 100.00% peter-next-level 100.00% (0.00%) 102.41% (2.41%) ~~~~~~~~~~~~~~~~~~ ~ DeathStarBench ~ ~~~~~~~~~~~~~~~~~~ Pinning Scaling tip peter-next-level 1 CCD 1 100.00% 100.30% (%diff: 0.30%) 2 CCD 2 100.00% 100.17% (%diff: 0.17%) 4 CCD 4 100.00% 99.60% (%diff: -0.40%) 8 CCD 8 100.00% 92.05% (%diff: -7.95%) * --- Based on the above data, the results seem to be mostly positive for the microbenchmarks but not so much for SpecJBB and DeathStarBench, which have high utilization. There is also band of outliers when the system is fully loaded or overloaded (~2 tasks per rq) for some of the microbenchmarks. I wonder if extending SIS_UTIL for SIS_NODE would help some of these cases but I've not tried tinkering with it yet. I'll continue testing on other NPS modes which would decrease the search scope. I'll also try running the same bunch of workloads on an even larger 4th Generation EPYC server to see if the behavior there is similar. Let me know if you need any data from from my test system for any specific workload. I'll be more than happy to get them for you :) > > Reported-by: Tejun Heo <tj@xxxxxxxxxx> > Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> > --- > kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++ > kernel/sched/features.h | 1 + > 2 files changed, 39 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 48b6f0c..0172458 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > } > > /* > + * For the multiple-LLC per node case, make sure to try the other LLC's if the > + * local LLC comes up empty. > + */ > +static int > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target) > +{ > + struct sched_domain *parent = sd->parent; > + struct sched_group *sg; > + > + /* Make sure to not cross nodes. */ > + if (!parent || parent->flags & SD_NUMA) > + return -1; > + > + sg = parent->groups; > + do { > + int cpu = cpumask_first(sched_group_span(sg)); > + struct sched_domain *sd_child; > + > + sd_child = per_cpu(sd_llc, cpu); > + if (sd_child != sd) { > + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu); > + if ((unsigned)i < nr_cpumask_bits) > + return i; > + } > + > + sg = sg->next; > + } while (sg != parent->groups); > + > + return -1; > +} > + > +/* > * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which > * the task fits. If no CPU is big enough, but there are idle ones, try to > * maximize capacity. > @@ -7199,6 +7231,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > if ((unsigned)i < nr_cpumask_bits) > return i; > > + if (sched_feat(SIS_NODE)) { > + i = select_idle_node(p, sd, target); > + if ((unsigned)i < nr_cpumask_bits) > + return i; > + } > + > return target; > } > > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index ee7f23c..9e390eb 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true) > */ > SCHED_FEAT(SIS_PROP, false) > SCHED_FEAT(SIS_UTIL, true) > +SCHED_FEAT(SIS_NODE, true) > > /* > * Issue a WARN when we do multiple update_rq_clock() calls -- Thanks and Regards, Prateek