Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Chenyu,

On 7/12/2023 10:49 PM, Chen Yu wrote:
> On 2023-07-08 at 21:17:10 +0800, Chen Yu wrote:
>> On 2023-07-05 at 13:57:02 +0200, Peter Zijlstra wrote:
>>> On Fri, Jun 16, 2023 at 12:04:48PM +0530, K Prateek Nayak wrote:
>>>
>>> --- a/arch/x86/kernel/smpboot.c
>>> +++ b/arch/x86/kernel/smpboot.c
>>> @@ -596,7 +596,7 @@ static inline int x86_sched_itmt_flags(v
>>>  #ifdef CONFIG_SCHED_MC
>>>  static int x86_core_flags(void)
>>>  {
>>> -	return cpu_core_flags() | x86_sched_itmt_flags();
>>> +	return cpu_core_flags() | x86_sched_itmt_flags() | SD_IDLE_SIBLING;
>>>  }
>> I guess this flag might need to be added into the valid mask:
>>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index d3a3b2646ec4..4a563e9f7b10 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1540,6 +1540,7 @@ static struct cpumask		***sched_domains_numa_masks;
>>  #define TOPOLOGY_SD_FLAGS		\
>>  	(SD_SHARE_CPUCAPACITY	|	\
>>  	 SD_SHARE_PKG_RESOURCES |	\
>> +	 SD_IDLE_SIBLING	|	\
>>  	 SD_NUMA		|	\
>>  	 SD_ASYM_PACKING)
>>>  #endif
>>>  #ifdef CONFIG_SCHED_SMT
>>> --- a/include/linux/sched/sd_flags.h
>>> +++ b/include/linux/sched/sd_flags.h
>>> @@ -161,3 +161,10 @@ SD_FLAG(SD_OVERLAP, SDF_SHARED_PARENT |
>>>   * NEEDS_GROUPS: No point in preserving domain if it has a single group.
>>>   */
>>>  SD_FLAG(SD_NUMA, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
>>> +
>>> +/*
>>> + * Search for idle CPUs in sibling groups
>>> + *
>>> + * NEEDS_GROUPS: Load balancing flag.
>>> + */
>>> +SD_FLAG(SD_IDLE_SIBLING, SDF_NEEDS_GROUPS)
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7046,6 +7046,38 @@ static int select_idle_cpu(struct task_s
>>>  }
>>>  
>>>  /*
>>> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
>>> + * local LLC comes up empty.
>>> + */
>>> +static int
>>> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
>>> +{
>>> +	struct sched_domain *parent = sd->parent;
>>> +	struct sched_group *sg;
>>> +
>>> +	/* Make sure to not cross nodes. */
>>> +	if (!parent || parent->flags & SD_NUMA)
>>> +		return -1;
>>> +
>>> +	sg = parent->groups;
>>> +	do {
>>> +		int cpu = cpumask_first(sched_group_span(sg));
>>> +		struct sched_domain *sd_child = per_cpu(sd_llc, cpu);
>>>
>> I wonder if we can use rcu_dereference() in case the cpu hotplug
>> changes the content sd_llc points to. (I'm still thinking of the
>> symptom you described here:)
>> https://lore.kernel.org/lkml/20230605190746.GX83892@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>>
>> I'll launch some tests with this version on Sapphire Rapids(and with/without LLC-split hack patch).
> 
> Tested on Sapphire Rapids, which has 2 x 56C/112T and 224 CPUs in total. C-states
> deeper than C1E are disabled. Turbo is disabled. CPU frequency governor is performance.
> 
> The baseline is v6.4-rc1 tip:sched/core, on top of
> commit 637c9509f3db ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()")
> 
> patch0: this SD_IDLE_SIBLING patch with above change to TOPOLOGY_SD_FLAGS
> patch1: hack patch to split 1 LLC domain into 4 smaller LLC domains(with some fixes on top of
>         https://lore.kernel.org/lkml/ZJKjvx%2FNxooM5z1Y@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/)
>         The test data in above link is invalid due to bugs in the hack patch, fixed in this version)
> 
> 
> Baseline vs Baseline+patch0:
> There is no much difference between the two, and it is expected because Sapphire Rapids
> does not have multiple LLC domains within 1 Numa node(also consider the run to run variation):
> 
> hackbench
> =========
> case            	load    	baseline(std%)	compare%( std%)
> process-pipe    	1-groups	 1.00 (  2.66)	+13.84 ( 12.80)
> process-pipe    	2-groups	 1.00 (  3.67)	 -8.37 (  2.33)
> process-pipe    	4-groups	 1.00 (  6.45)	 +4.17 (  6.36)
> process-pipe    	8-groups	 1.00 (  1.69)	 +2.28 (  1.72)
> process-sockets 	1-groups	 1.00 (  1.73)	 +0.61 (  0.69)
> process-sockets 	2-groups	 1.00 (  2.68)	 -2.20 (  0.55)
> process-sockets 	4-groups	 1.00 (  0.03)	 -0.34 (  0.17)
> process-sockets 	8-groups	 1.00 (  0.09)	 -0.28 (  0.09)
> threads-pipe    	1-groups	 1.00 (  2.42)	 +6.95 (  3.86)
> threads-pipe    	2-groups	 1.00 (  2.26)	 +2.68 (  6.56)
> threads-pipe    	4-groups	 1.00 (  5.08)	 +3.57 (  4.61)
> threads-pipe    	8-groups	 1.00 (  7.89)	 -2.52 (  3.45)
> threads-sockets 	1-groups	 1.00 (  1.15)	 +0.87 (  3.13)
> threads-sockets 	2-groups	 1.00 (  0.63)	 -0.02 (  1.27)
> threads-sockets 	4-groups	 1.00 (  0.27)	 +0.29 (  0.17)
> threads-sockets 	8-groups	 1.00 (  0.07)	 -0.42 (  0.40)
> 
> netperf
> =======
> case            	load    	baseline(std%)	compare%( std%)
> TCP_RR          	56-threads	 1.00 (  2.56)	 -0.25 (  3.27)
> TCP_RR          	112-threads	 1.00 (  2.26)	 +0.04 (  2.18)
> TCP_RR          	168-threads	 1.00 (  0.81)	 +0.01 (  0.74)
> TCP_RR          	224-threads	 1.00 (  0.65)	 +0.04 (  0.66)
> TCP_RR          	280-threads	 1.00 ( 64.56)	+69.47 ( 56.78)
> TCP_RR          	336-threads	 1.00 ( 20.39)	 +0.08 ( 19.58)
> TCP_RR          	392-threads	 1.00 ( 31.63)	 +0.17 ( 31.08)
> TCP_RR          	448-threads	 1.00 ( 39.72)	 -0.14 ( 39.14)
> UDP_RR          	56-threads	 1.00 (  8.94)	 -0.71 ( 12.03)
> UDP_RR          	112-threads	 1.00 ( 18.72)	 +0.78 ( 16.71)
> UDP_RR          	168-threads	 1.00 ( 11.39)	 -0.18 (  8.34)
> UDP_RR          	224-threads	 1.00 (  9.02)	 +0.81 ( 11.47)
> UDP_RR          	280-threads	 1.00 ( 15.87)	 -0.12 ( 12.87)
> UDP_RR          	336-threads	 1.00 ( 39.89)	 +2.25 ( 32.35)
> UDP_RR          	392-threads	 1.00 ( 28.17)	 +3.47 ( 25.99)
> UDP_RR          	448-threads	 1.00 ( 58.68)	 +0.35 ( 56.16)
> 
> tbench
> ======
> case            	load    	baseline(std%)	compare%( std%)
> loopback        	56-threads	 1.00 (  0.94)	 +0.24 (  0.69)
> loopback        	112-threads	 1.00 (  0.19)	 +0.18 (  0.25)
> loopback        	168-threads	 1.00 ( 52.17)	 -1.42 ( 50.95)
> loopback        	224-threads	 1.00 (  0.86)	 -0.38 (  0.19)
> loopback        	280-threads	 1.00 (  0.12)	 -0.28 (  0.17)
> loopback        	336-threads	 1.00 (  0.10)	 -0.33 (  0.19)
> loopback        	392-threads	 1.00 (  0.27)	 -0.49 (  0.26)
> loopback        	448-threads	 1.00 (  0.06)	 -0.88 (  0.59)
> 
> schbench
> ========
> case            	load    	baseline(std%)	compare%( std%)
> normal          	1-mthreads	 1.00 (  0.72)	 -1.47 (  0.41)
> normal          	2-mthreads	 1.00 (  1.66)	 +1.18 (  2.63)
> normal          	4-mthreads	 1.00 (  1.12)	 +1.20 (  4.52)
> normal          	8-mthreads	 1.00 ( 11.03)	 -3.87 (  5.14)
> 
> 
> Baseline+patch1    vs    Baseline+patch0+patch1:
> 
> With multiple LLC domains in 1 Numa node, SD_IDLE_SIBLING brings improvement
> to hackbench/schbench, while brings downgrading to netperf/tbench. This is aligned
> with what was observed previously, if the waker and wakee wakes up each other
> frequently, they would like to be put together for cache locality. While for
> other tasks do not have shared resource, always choosing an idle CPU is better.
> Maybe in the future we can look back at SIS_SHORT and terminates scan in
> select_idle_node() if the waker and wakee have close relationship with
> each other.

Gautham and I were discussing this and realized that when calling
ttwu_queue_wakelist(), in a simulated split-LLC case, ttwu_queue_cond()
will recommend using the wakelist and send an IPI despite the
groups of the DIE domain sharing the cache in your case.

Can you check if the following change helps the regression?
(Note: Completely untested and there may be other such cases lurking
around that we've not yet considered)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..a8cab1c81aca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3929,7 +3929,7 @@ static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
 	 * If the CPU does not share cache, then queue the task on the
 	 * remote rqs wakelist to avoid accessing remote data.
 	 */
-	if (!cpus_share_cache(smp_processor_id(), cpu))
+	if (cpu_to_node(smp_processor_id()) !=  cpu_to_node(cpu))
 		return true;
 
 	if (cpu == smp_processor_id())
--

> 
> 
> hackbench
> =========
> case            	load    	baseline(std%)	compare%( std%)
> process-pipe    	1-groups	 1.00 (  0.25)	+31.65 (  6.77)
> process-pipe    	2-groups	 1.00 (  0.28)	+29.50 (  5.35)
> process-pipe    	4-groups	 1.00 (  0.08)	+16.77 (  1.30)
> process-pipe    	8-groups	 1.00 (  0.20)	 -5.18 (  0.04)
> process-sockets 	1-groups	 1.00 (  0.23)	+13.68 (  1.28)
> process-sockets 	2-groups	 1.00 (  0.16)	+11.18 (  1.87)
> process-sockets 	4-groups	 1.00 (  0.23)	 -0.06 (  0.21)
> process-sockets 	8-groups	 1.00 (  0.36)	 +2.34 (  0.15)
> threads-pipe    	1-groups	 1.00 (  5.23)	+16.38 ( 12.10)
> threads-pipe    	2-groups	 1.00 (  1.63)	+28.52 (  5.17)
> threads-pipe    	4-groups	 1.00 (  0.77)	+23.28 (  2.42)
> threads-pipe    	8-groups	 1.00 (  2.27)	 +2.35 (  5.75)
> threads-sockets 	1-groups	 1.00 (  2.31)	 +0.42 (  1.68)
> threads-sockets 	2-groups	 1.00 (  0.56)	 +3.98 (  0.65)
> threads-sockets 	4-groups	 1.00 (  0.12)	 +0.29 (  0.32)
> threads-sockets 	8-groups	 1.00 (  0.86)	 +1.92 (  0.27)
> 
> netperf
> =======
> case            	load    	baseline(std%)	compare%( std%)
> TCP_RR          	56-threads	 1.00 ( 12.46)	 -1.62 ( 12.14)
> TCP_RR          	112-threads	 1.00 (  1.34)	 -0.16 (  1.42)
> TCP_RR          	168-threads	 1.00 (  6.26)	 -0.88 (  6.08)
> TCP_RR          	224-threads	 1.00 (  2.19)	-90.18 (  6.12)
> TCP_RR          	280-threads	 1.00 ( 12.27)	-63.81 ( 74.25)
> TCP_RR          	336-threads	 1.00 ( 29.28)	 -6.21 ( 18.48)
> TCP_RR          	392-threads	 1.00 ( 39.39)	 -3.87 ( 26.63)
> TCP_RR          	448-threads	 1.00 ( 47.45)	 -2.34 ( 32.37)
> UDP_RR          	56-threads	 1.00 (  3.28)	 -0.31 (  2.81)
> UDP_RR          	112-threads	 1.00 (  7.03)	 +0.55 (  7.03)
> UDP_RR          	168-threads	 1.00 ( 17.42)	 -0.51 ( 15.63)
> UDP_RR          	224-threads	 1.00 ( 20.79)	-68.28 ( 14.32)
> UDP_RR          	280-threads	 1.00 ( 26.23)	-68.58 ( 18.60)
> UDP_RR          	336-threads	 1.00 ( 38.99)	 -0.55 ( 21.19)
> UDP_RR          	392-threads	 1.00 ( 44.22)	 -1.91 ( 27.44)
> UDP_RR          	448-threads	 1.00 ( 55.11)	 -2.74 ( 38.55)
> 
> tbench
> ======
> case            	load    	baseline(std%)	compare%( std%)
> loopback        	56-threads	 1.00 (  2.69)	 -2.30 (  2.69)
> loopback        	112-threads	 1.00 (  1.92)	 +0.62 (  1.46)
> loopback        	168-threads	 1.00 (  0.97)	-67.69 (  0.06)
> loopback        	224-threads	 1.00 (  0.24)	 -6.79 (  8.81)
> loopback        	280-threads	 1.00 (  0.10)	 +0.47 (  0.62)
> loopback        	336-threads	 1.00 (  0.85)	 -0.05 (  0.05)
> loopback        	392-threads	 1.00 (  0.62)	 +0.77 (  0.50)
> loopback        	448-threads	 1.00 (  0.36)	 +0.77 (  0.77)
> 
> schbench
> ========
> case            	load    	baseline(std%)	compare%( std%)
> normal          	1-mthreads	 1.00 (  0.82)	 +1.44 (  1.24)
> normal          	2-mthreads	 1.00 (  2.13)	 +1.16 (  0.41)
> normal          	4-mthreads	 1.00 (  3.82)	 -0.30 (  1.48)
> normal          	8-mthreads	 1.00 (  4.80)	+22.43 ( 13.03)
> 
> But since the multiple LLC is just a simulation on Intel platform for now,
> the patch is ok and:
> 
> Tested-by: Chen Yu <yu.c.chen@xxxxxxxxx>
> 
> thanks,
> Chenyu

--
Thanks and Regards,
Prateek



[Index of Archives]     [Linux Stable Commits]     [Linux Stable Kernel]     [Linux Kernel]     [Linux USB Devel]     [Linux Video &Media]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux