There are reports about severe lock contention for slub's per-node 'list_lock' in 'hackbench' test, [1][2], on server systems. And similar contention is also seen when running 'mmap1' case of will-it-scale on big systems. As the trend is one processor (socket) will have more and more CPUs (100+, 200+), the contention could be much more severe and becomes a scalability issue. One way to help reducing the contention is to double the per-cpu partial number for large systems. Following is some performance data, where it shows big improvment in will-it-scale/mmap1 case, but no ovbious change for the 'hackbench' test. The patch itself only makes the per-cpu partial number 2X, and for better analysis, the 4X case is also profiled will-it-scale/mmap1 ------------------- Run will-it-scale benchmark's 'mmap1' test case on a 2 socket Sapphire Rapids server (112 cores / 224 threads) with 256 GB DRAM, run 3 configurations with parallel test threads of 25%, 50% and 100% of number of CPUs, and the data is (base is vanilla v6.5 kernel): base base + 2X patch base + 4X patch wis-mmap1-25 223670 +12.7% 251999 +34.9% 301749 per_process_ops wis-mmap1-50 186020 +28.0% 238067 +55.6% 289521 per_process_ops wis-mmap1-100 89200 +40.7% 125478 +62.4% 144858 per_process_ops Take the perf-profile comparasion of 50% test case, the lock contention is greatly reduced: 43.80 -11.5 32.27 -27.9 15.91 pp.self.native_queued_spin_lock_slowpath hackbench --------- Run same hackbench testcase mentioned in [1], use same HW/SW as will-it-scale: base base + 2X patch base + 4X patch hackbench 759951 +0.2% 761506 +0.5% 763972 hackbench.throughput [1]. https://lore.kernel.org/all/202307172140.3b34825a-oliver.sang@xxxxxxxxx/ [2]. ttps://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/ Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx> --- mm/slub.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/slub.c b/mm/slub.c index f7940048138c..51ca6dbaad09 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4361,6 +4361,13 @@ static void set_cpu_partial(struct kmem_cache *s) else nr_objects = 120; + /* + * Give larger system more per-cpu partial slabs to reduce/postpone + * contending per-node partial list. + */ + if (num_cpus() >= 32) + nr_objects *= 2; + slub_set_cpu_partial(s, nr_objects); #endif } -- 2.27.0