Re: [RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks within one LLC

Dietmar Eggemann <dietmar.eggemann@xxxxxxx> · Fri, 30 Apr 2021 12:42:56 +0200

On 29/04/2021 00:41, Song Bao Hua (Barry Song) wrote:
> 
> 
>> -----Original Message-----
>> From: Dietmar Eggemann [mailto:dietmar.eggemann@xxxxxxx]

[...]

>>>>> From: Dietmar Eggemann [mailto:dietmar.eggemann@xxxxxxx]
>>
>> [...]
>>
>>>>> On 20/04/2021 02:18, Barry Song wrote:

[...]

> Though we will never go to slow path, wake_wide() will affect want_affine,
> so eventually affect the "new_cpu"?

yes.

> 
> 	for_each_domain(cpu, tmp) {
> 		/*
> 		 * If both 'cpu' and 'prev_cpu' are part of this domain,
> 		 * cpu is a valid SD_WAKE_AFFINE target.
> 		 */
> 		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> 		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> 			if (cpu != prev_cpu)
> 				new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);
> 
> 			sd = NULL; /* Prefer wake_affine over balance flags */
> 			break;
> 		}
> 
> 		if (tmp->flags & sd_flag)
> 			sd = tmp;
> 		else if (!want_affine)
> 			break;
> 	}
> 
> If wake_affine is false, the above won't execute, new_cpu(target) will
> always be "prev_cpu"? so when task size > cluster size in wake_wide(),
> this means we won't pull the wakee to the cluster of waker? It seems
> sensible.

What is `task size` here?

The criterion is `!(slave < factor || master < slave * factor)` or
`slave >= factor && master >= slave * factor` to wake wide.

I see that since you effectively change the sched domain size from LLC
to CLUSTER (e.g. 24->6) for wakeups with cpu and prev_cpu sharing LLC
(hence the `numactl -N 0` in your workload), wake_wide() has to take
CLUSTER size into consideration.

I was wondering if you saw wake_wide() returning 1 with your use cases:

numactl -N 0 /usr/lib/lmbench/bin/stream -P [6,12] -M 1024M -N 5