On Thu, Jun 01, 2023 at 03:03:39PM +0530, K Prateek Nayak wrote: > Hello Peter, > > Sharing some initial benchmark results with the patch below. > > tl;dr > > - Hackbench starts off well but performance drops as the number of groups > increases. > > - schbench (old), tbench, netperf see improvement but there is a band of > outlier results when system is fully loaded or slightly overloaded. > > - Stream and ycsb-mongodb are don't mind the extra search. > > - SPECjbb (with default scheduler tunables) and DeathStarBench are not > very happy. Figures :/ Every time something like this is changed someone gets to be sad.. > Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T) > running in NPS1 mode. Following it the simplified machine topology: Right, Zen3 8 cores / LLC, 64 cores total give 8 LLC per node. > ~~~~~~~~~~~~~~~~~~~~~~~ > ~ SPECjbb - Multi-JVM ~ > ~~~~~~~~~~~~~~~~~~~~~~~ > > o NPS1 > > - Default Scheduler Tunables > > kernel max-jOPS critical-jOPS > tip 100.00% 100.00% > peter-next-level 94.45% (-5.55%) 98.25% (-1.75%) > > - Modified Scheduler Tunables > > kernel max-jOPS critical-jOPS > tip 100.00% 100.00% > peter-next-level 100.00% (0.00%) 102.41% (2.41%) I'm slightly confused, either the default or the tuned is better. Given it's counting ops, I'm thinking higher is more better, so isn't this an improvement in the tuned case? > ~~~~~~~~~~~~~~~~~~ > ~ DeathStarBench ~ > ~~~~~~~~~~~~~~~~~~ > > Pinning Scaling tip peter-next-level > 1 CCD 1 100.00% 100.30% (%diff: 0.30%) > 2 CCD 2 100.00% 100.17% (%diff: 0.17%) > 4 CCD 4 100.00% 99.60% (%diff: -0.40%) > 8 CCD 8 100.00% 92.05% (%diff: -7.95%) * Right, so that's a definite loss. > I wonder if extending SIS_UTIL for SIS_NODE would help some of these > cases but I've not tried tinkering with it yet. I'll continue > testing on other NPS modes which would decrease the search scope. > I'll also try running the same bunch of workloads on an even larger > 4th Generation EPYC server to see if the behavior there is similar. > > /* > > + * For the multiple-LLC per node case, make sure to try the other LLC's if the > > + * local LLC comes up empty. > > + */ > > +static int > > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target) > > +{ > > + struct sched_domain *parent = sd->parent; > > + struct sched_group *sg; > > + > > + /* Make sure to not cross nodes. */ > > + if (!parent || parent->flags & SD_NUMA) > > + return -1; > > + > > + sg = parent->groups; > > + do { > > + int cpu = cpumask_first(sched_group_span(sg)); > > + struct sched_domain *sd_child; > > + > > + sd_child = per_cpu(sd_llc, cpu); > > + if (sd_child != sd) { > > + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu); Given how SIS_UTIL is inside select_idle_cpu() it should already be effective here, no? > > + if ((unsigned)i < nr_cpumask_bits) > > + return i; > > + } > > + > > + sg = sg->next; > > + } while (sg != parent->groups); > > + > > + return -1; > > +} This DeathStarBench thing seems to suggest that scanning up to 4 CCDs isn't too much of a bother; so perhaps something like so? (on top of tip/sched/core from just a few hours ago, as I had to 'fix' this patch and force pushed the thing) And yeah, random hacks and heuristics here :/ Does there happen to be additional topology that could aid us here? Does the CCD fabric itself have a distance metric we can use? --- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 22e0a249e0a8..f1d6ed973410 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7036,6 +7036,7 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target) { struct sched_domain *parent = sd->parent; struct sched_group *sg; + int nr = 4; /* Make sure to not cross nodes. */ if (!parent || parent->flags & SD_NUMA) @@ -7050,6 +7051,9 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target) test_idle_cores(cpu), cpu); if ((unsigned)i < nr_cpumask_bits) return i; + + if (!--nr) + return -1; } sg = sg->next;
![]() |