On 2023-06-14 at 17:13:48 +0200, Peter Zijlstra wrote: > On Wed, Jun 14, 2023 at 10:58:20PM +0800, Chen Yu wrote: > > On 2023-06-14 at 10:17:57 +0200, Peter Zijlstra wrote: > > > On Tue, Jun 13, 2023 at 04:00:39PM +0530, K Prateek Nayak wrote: > > > > > > > >> - SIS_NODE_TOPOEXT - tip:sched/core + this patch > > > > >> + new sched domain (Multi-Multi-Core or MMC) > > > > >> (https://lore.kernel.org/all/20230601153522.GB559993@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/) > > > > >> MMC domain groups 2 nearby CCX. > > > > > > > > > > OK, so you managed to get the NPS4 topology in NPS1 mode? > > > > > > > > Yup! But it is a hack. I'll leave the patch at the end. > > > > > > Chen Yu, could we do the reverse? Instead of building a bigger LLC > > > domain, can we split our LLC based on SNC (sub-numa-cluster) topologies? > > > > > Hi Peter, > > Do you mean with SNC enabled, if the LLC domain gets smaller? > > According to the test, the answer seems to be yes. > > No, I mean to build smaller LLC domains even with SNC disabled, as-if > SNC were active. > > The topology on Sapphire Rapids is that there are 4 memory controllers within 1 package per lstopo result, and the LLCs could have slightly difference distance to the 4 mc with SNC disabled. Unfortunately there is no interface for the OS to query this partition. I used a hack to split the LLC into 4 smaller ones with SNC disabled, according to the topology in SNC4. Then I had a test on this platform with/withouth this LLC split, both with SIS_NODE enabled and with this issue fixed[1]. Something like this when iterating the groups in select_idle_node(): if (cpumask_test_cpu(target, sched_group_span(sg))) continue; The SIS_NODE should have no impact on non-LLC-split version on Sapphire Rapids, so the baseline is vanilla+SIS_NODE. In summary, huge improvement from netperf was observed, but also regression from hackbench/schbench was observed when the system is under load. I'll collect some schedstats to check the scan depth in the problematic cases. With SNC disabled and with the hack llc-split patch applied, there is a new Die domain generated, the LLC is divided into 4 sub-llc groups: grep . domain*/{name,flags} domain0/name:SMT domain1/name:MC domain2/name:DIE domain3/name:NUMA domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING domain3/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA cat /proc/schedstat | grep cpu0 -A 4 cpu0 0 0 0 0 0 0 15968391465 3630455022 18084 domain0 00000000,00000000,00000000,00010000,00000000,00000000,00000001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 00000000,00000000,00000000,3fff0000,00000000,00000000,00003fff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain2 00000000,000000ff,ffffffff,ffff0000,00000000,00ffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 hackbench ========= case load baseline(std%) compare%( std%) process-pipe 1-groups 1.00 ( 3.81) -100.18 ( 0.19) process-pipe 2-groups 1.00 ( 10.74) -59.21 ( 0.91) process-pipe 4-groups 1.00 ( 5.37) -56.37 ( 0.56) process-pipe 8-groups 1.00 ( 0.36) +17.11 ( 0.82) process-sockets 1-groups 1.00 ( 0.09) -26.53 ( 1.45) process-sockets 2-groups 1.00 ( 0.82) -26.45 ( 0.40) process-sockets 4-groups 1.00 ( 0.21) -4.09 ( 0.19) process-sockets 8-groups 1.00 ( 0.13) -5.31 ( 0.36) threads-pipe 1-groups 1.00 ( 2.14) -62.87 ( 1.11) threads-pipe 2-groups 1.00 ( 3.18) -55.82 ( 1.14) threads-pipe 4-groups 1.00 ( 4.68) -54.92 ( 0.34) threads-pipe 8-groups 1.00 ( 5.08) +15.81 ( 3.08) threads-sockets 1-groups 1.00 ( 2.60) -18.28 ( 6.03) threads-sockets 2-groups 1.00 ( 0.83) -30.17 ( 0.60) threads-sockets 4-groups 1.00 ( 0.16) -4.15 ( 0.27) threads-sockets 8-groups 1.00 ( 0.36) -5.92 ( 0.94) The 1 group, 2 groups, 4 groups suffered. netperf ======= case load baseline(std%) compare%( std%) TCP_RR 56-threads 1.00 ( 2.75) +10.49 ( 10.88) TCP_RR 112-threads 1.00 ( 2.39) -1.88 ( 2.82) TCP_RR 168-threads 1.00 ( 2.05) +8.31 ( 9.73) TCP_RR 224-threads 1.00 ( 2.32) +788.25 ( 1.94) TCP_RR 280-threads 1.00 ( 59.77) +83.07 ( 12.38) TCP_RR 336-threads 1.00 ( 21.61) -0.22 ( 28.72) TCP_RR 392-threads 1.00 ( 31.26) -0.13 ( 36.11) TCP_RR 448-threads 1.00 ( 39.93) -0.14 ( 45.71) UDP_RR 56-threads 1.00 ( 5.57) +2.38 ( 7.41) UDP_RR 112-threads 1.00 ( 24.53) +1.51 ( 8.43) UDP_RR 168-threads 1.00 ( 11.83) +7.34 ( 20.20) UDP_RR 224-threads 1.00 ( 10.55) +163.81 ( 20.64) UDP_RR 280-threads 1.00 ( 11.32) +176.04 ( 21.83) UDP_RR 336-threads 1.00 ( 31.79) +12.87 ( 37.23) UDP_RR 392-threads 1.00 ( 34.06) +15.64 ( 44.62) UDP_RR 448-threads 1.00 ( 59.09) +14.00 ( 52.93) The 224-thread/280-threads show good improvement. tbench ====== case load baseline(std%) compare%( std%) loopback 56-threads 1.00 ( 0.83) +1.38 ( 1.56) loopback 112-threads 1.00 ( 0.19) -4.25 ( 0.90) loopback 168-threads 1.00 ( 56.43) -31.12 ( 0.37) loopback 224-threads 1.00 ( 0.28) -2.50 ( 0.44) loopback 280-threads 1.00 ( 0.10) -1.64 ( 0.81) loopback 336-threads 1.00 ( 0.19) -2.10 ( 0.10) loopback 392-threads 1.00 ( 0.13) -2.15 ( 0.39) loopback 448-threads 1.00 ( 0.45) -2.14 ( 0.43) Might have no impact to tbench(the 168 threads result is unstable and could be ignored) schbench ======== case load baseline(std%) compare%( std%) normal 1-mthreads 1.00 ( 0.42) -0.59 ( 0.72) normal 2-mthreads 1.00 ( 2.72) +1.76 ( 0.42) normal 4-mthreads 1.00 ( 0.75) -1.22 ( 1.86) normal 8-mthreads 1.00 ( 6.44) -14.56 ( 5.64) 8 message case is not good for schbench. diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 352f0ce1ece4..ffc44639447e 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -511,6 +511,30 @@ static const struct x86_cpu_id intel_cod_cpu[] = { {} }; +static unsigned int sub_llc_nr; + +static int __init parse_sub_llc(char *str) +{ + get_option(&str, &sub_llc_nr); + + return 0; +} +early_param("sub_llc_nr", parse_sub_llc); + +static bool +topology_same_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) +{ + int idx1, idx2; + + if (!sub_llc_nr) + return true; + + idx1 = c->apicid / sub_llc_nr; + idx2 = o->apicid / sub_llc_nr; + + return idx1 == idx2; +} + static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) { const struct x86_cpu_id *id = x86_match_cpu(intel_cod_cpu); @@ -530,7 +554,7 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) * means 'c' does not share the LLC of 'o'. This will be * reflected to userspace. */ - if (match_pkg(c, o) && !topology_same_node(c, o) && intel_snc) + if (match_pkg(c, o) && (!topology_same_node(c, o) || !topology_same_llc(c, o)) && intel_snc) return false; return topology_sane(c, o, "llc"); -- 2.25.1 [1] https://lore.kernel.org/lkml/5903fc0a-787e-9471-0256-77ff66f0bdef@xxxxxxxxxxxxx/