Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Thu, 1 Jun 2023 13:13:26 +0200

On Thu, Jun 01, 2023 at 03:03:39PM +0530, K Prateek Nayak wrote:
> Hello Peter, 
> 
> Sharing some initial benchmark results with the patch below.
> 
> tl;dr
> 
> - Hackbench starts off well but performance drops as the number of groups
>   increases.
> 
> - schbench (old), tbench, netperf see improvement but there is a band of
>   outlier results when system is fully loaded or slightly overloaded.
> 
> - Stream and ycsb-mongodb are don't mind the extra search.
> 
> - SPECjbb (with default scheduler tunables) and DeathStarBench are not
>   very happy.

Figures :/ Every time something like this is changed someone gets to be
sad..

> Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
> running in NPS1 mode. Following it the simplified machine topology:

Right, Zen3 8 cores / LLC, 64 cores total give 8 LLC per node.

> ~~~~~~~~~~~~~~~~~~~~~~~
> ~ SPECjbb - Multi-JVM ~
> ~~~~~~~~~~~~~~~~~~~~~~~
> 
> o NPS1
> 
> - Default Scheduler Tunables
> 
> kernel			max-jOPS		critical-jOPS
> tip			100.00%			100.00%
> peter-next-level	 94.45% (-5.55%)	 98.25% (-1.75%)
> 
> - Modified Scheduler Tunables
> 
> kernel			max-jOPS		critical-jOPS
> tip			100.00%			100.00%
> peter-next-level	100.00% (0.00%)		102.41% (2.41%)

I'm slightly confused, either the default or the tuned is better. Given
it's counting ops, I'm thinking higher is more better, so isn't this an
improvement in the tuned case?

> ~~~~~~~~~~~~~~~~~~
> ~ DeathStarBench ~
> ~~~~~~~~~~~~~~~~~~
> 
> Pinning   Scaling	tip		peter-next-level
> 1 CCD     1             100.00%      	100.30% (%diff:  0.30%)
> 2 CCD     2             100.00%      	100.17% (%diff:  0.17%)
> 4 CCD     4             100.00%      	 99.60% (%diff: -0.40%)
> 8 CCD     8             100.00%      	 92.05% (%diff: -7.95%)	*

Right, so that's a definite loss.

> I wonder if extending SIS_UTIL for SIS_NODE would help some of these
> cases but I've not tried tinkering with it yet. I'll continue
> testing on other NPS modes which would decrease the search scope.
> I'll also try running the same bunch of workloads on an even larger
> 4th Generation EPYC server to see if the behavior there is similar.

> >  /*
> > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > + * local LLC comes up empty.
> > + */
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > +	struct sched_domain *parent = sd->parent;
> > +	struct sched_group *sg;
> > +
> > +	/* Make sure to not cross nodes. */
> > +	if (!parent || parent->flags & SD_NUMA)
> > +		return -1;
> > +
> > +	sg = parent->groups;
> > +	do {
> > +		int cpu = cpumask_first(sched_group_span(sg));
> > +		struct sched_domain *sd_child;
> > +
> > +		sd_child = per_cpu(sd_llc, cpu);
> > +		if (sd_child != sd) {
> > +			int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);

Given how SIS_UTIL is inside select_idle_cpu() it should already be
effective here, no?

> > +			if ((unsigned)i < nr_cpumask_bits)
> > +				return i;
> > +		}
> > +
> > +		sg = sg->next;
> > +	} while (sg != parent->groups);
> > +
> > +	return -1;
> > +}

This DeathStarBench thing seems to suggest that scanning up to 4 CCDs
isn't too much of a bother; so perhaps something like so?

(on top of tip/sched/core from just a few hours ago, as I had to 'fix'
this patch and force pushed the thing)

And yeah, random hacks and heuristics here :/ Does there happen to be
additional topology that could aid us here? Does the CCD fabric itself
have a distance metric we can use?

---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22e0a249e0a8..f1d6ed973410 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7036,6 +7036,7 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct sched_domain *parent = sd->parent;
 	struct sched_group *sg;
+	int nr = 4;
 
 	/* Make sure to not cross nodes. */
 	if (!parent || parent->flags & SD_NUMA)
@@ -7050,6 +7051,9 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
 						test_idle_cores(cpu), cpu);
 			if ((unsigned)i < nr_cpumask_bits)
 				return i;
+
+			if (!--nr)
+				return -1;
 		}
 
 		sg = sg->next;