On AMD Family17h-based (EPYC) system, a NUMA node can contain upto 8 cores (16 threads) with the following topology. ---------------------------- C0 | T0 T1 | || | T0 T1 | C4 --------| || |-------- C1 | T0 T1 | L3 || L3 | T0 T1 | C5 --------| || |-------- C2 | T0 T1 | #0 || #1 | T0 T1 | C6 --------| || |-------- C3 | T0 T1 | || | T0 T1 | C7 ---------------------------- Here, there are 2 last-level (L3) caches per NUMA node. A socket can contain upto 4 NUMA nodes, and a system can support upto 2 sockets. With full system configuration, current scheduler creates 4 sched domains: domain0 SMT (span a core) domain1 MC (span a last-level-cache) domain2 NUMA (span a socket: 4 nodes) domain3 NUMA (span a system: 8 nodes) Note that there is no domain to represent cpus spaning a NUMA node. With this hierarchy of sched domains, the scheduler does not balance properly in the following cases: Case1: When running 8 tasks, a properly balanced system should schedule a task per NUMA node. This is not the case for the current scheduler. Case2: Sometimes, threads are scheduled on the same cpu, while other cpus are idle. This results in run-to-run inconsistency. For example: taskset -c 0-7 sysbench --num-threads=8 --test=cpu \ --cpu-max-prime=100000 run Total execution time ranges from 25.1s to 33.5s depending on threads placement, where 25.1s is when all 8 threads are balanced properly across 8 cpus. Introducing NUMA identity node sched domain, which is based on how SRAT/SLIT table define a NUMA node. This results in the following hierarchy of sched domains on the same system described above. domain0 SMT (span a core) domain1 MC (span a last-level-cache) domain2 NODE (span a NUMA node) domain3 NUMA (span a socket: 4 nodes) domain4 NUMA (span a system: 8 nodes) This fixes the improper load balancing cases mentioned above. Cc: stable@xxxxxxxxxxxxxxx Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@xxxxxxx> --- Changes from V1 (https://lkml.org/lkml/2017/8/10/540) * Update commit message to include performance number. * Change from NUMA_IDEN to NODE. * Fix code styling and update comments. kernel/sched/topology.c | 26 +++++++++++++++++++++++--- 1 file changed, 23 insertions(+), 3 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 79895ae..2dd5b11 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1335,6 +1335,10 @@ void sched_init_numa(void) if (!sched_domains_numa_distance) return; + /* Includes NUMA identity node at level 0. */ + sched_domains_numa_distance[level++] = curr_distance; + sched_domains_numa_levels = level; + /* * O(nr_nodes^2) deduplicating selection sort -- in order to find the * unique distances in the node_distance() table. @@ -1382,8 +1386,7 @@ void sched_init_numa(void) return; /* - * 'level' contains the number of unique distances, excluding the - * identity distance node_distance(i,i). + * 'level' contains the number of unique distances * * The sched_domains_numa_distance[] array includes the actual distance * numbers. @@ -1445,9 +1448,26 @@ void sched_init_numa(void) tl[i] = sched_domain_topology[i]; /* + * Do not setup NUMA node level if it has the same cpumask + * as sched domain at previous level: + * This is the case for system with: + * - LLC == NODE : LLC (MC) sched domain span a NUMA node. + * - DIE == NODE : DIE sched domain span a NUMA node. + * + * Assume all NUMA nodes are identical, so only check node 0. + */ + if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0))) { + tl[i++] = (struct sched_domain_topology_level){ + .mask = sd_numa_mask, + .numa_level = 0, + SD_INIT_NAME(NODE) + }; + } + + /* * .. and append 'j' levels of NUMA goodness. */ - for (j = 0; j < level; i++, j++) { + for (j = 1; j < level; i++, j++) { tl[i] = (struct sched_domain_topology_level){ .mask = sd_numa_mask, .sd_flags = cpu_numa_flags, -- 2.7.4