On Tue, Jul 25, 2023 at 05:20:01PM +0800, Tang, Feng wrote: > On Tue, Jul 25, 2023 at 12:13:56PM +0900, Hyeonggon Yoo wrote: > [...] > > > > > > I run the reproduce command in a local 2-socket box: > > > > > > "/usr/bin/hackbench" "-g" "128" "-f" "20" "--process" "-l" "30000" "-s" "100" > > > > > > And found 2 kmem_cache has been boost: 'kmalloc-cg-512' and > > > 'skbuff_head_cache'. Only order of 'kmalloc-cg-512' was reduced > > > from 3 to 2 with the patch, while its 'cpu_partial_slabs' was bumped > > > from 2 to 4. The setting of 'skbuff_head_cache' was kept unchanged. > > > > > > And this compiled with the perf-profile info from 0Day's report, that the > > > 'list_lock' contention is increased with the patch: > > > > > > 13.71% 13.70% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath - - > > > 5.80% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;__unfreeze_partials;skb_release_data;consume_skb;unix_stream_read_generic;unix_stream_recvmsg;sock_recvmsg;sock_read_iter;vfs_read;ksys_read;do_syscall_64;entry_SYSCALL_64_after_hwframe;__libc_read > > > 5.56% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;get_partial_node.part.0;___slab_alloc.constprop.0;__kmem_cache_alloc_node;__kmalloc_node_track_caller;kmalloc_reserve;__alloc_skb;alloc_skb_with_frags;sock_alloc_send_pskb;unix_stream_sendmsg;sock_write_iter;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;__libc_write > > > > Oh... neither of the assumptions were not true. > > AFAICS it's a case of decreasing slab order increases lock contention, > > > > The number of cached objects per CPU is mostly the same (not exactly same, > > because the cpu slab is not accounted for), > > Yes, this makes sense! > > > but only increases the > > number of slabs > > to process while taking slabs (get_partial_node()), and flushing the current > > cpu partial list. (put_cpu_partial() -> __unfreeze_partials()) > > > > Can we do better in this situation? improve __unfreeze_partials()? > > We can check that, IMHO, current MIN_PARTIAL and MAX_PARTIAL are too > small as a global parameter, especially for server platforms with > hundreds of GB or TBs memory. > > As for 'list_lock', I'm thinking of bumping the number of per-cpu > objects in set_cpu_partial(), at least give user an option to do > that for sever platforms with huge mount of memory. Will do some test > around it, and let 0Day's peformance testing framework monitor > for any regression. Before this performance regression of 'hackbench', I've noticed other cases where the per-node 'list-lock' is contended. With one processor (socket/node) can have more and more CPUs (100+ or 200+), the scalability problem could be much worse. So we may need to tackle it soon or later, and surely we may need to separate the handling for large platforms which suffer from scalability issue and small platforms who care more about memory footprint. For solving the scalability issue for large systems with big number of CPU and memory, I tried 3 hacky patches for quick measurement: 1) increase the MIN_PARTIAL and MAX_PARTIAL to let each node have more (64) partial slabs in maxim 2) increase the order of each slab (including changing the max slub order to 4) 3) increase number of per-cpu partial slabs These patches are mostly independent over each other. And run will-it-scale benchmark's 'mmap1' test case on a 2 socket Sapphire Rapids server (112 cores, 224 threads) with 256 GB DRAM, run 3 configurations with parallel test threads of 25%, 50% and 100% of number of CPUs, and the data is (base is vanilla v6.5 kernel): base base + patch-1 base + patch-1,2 base + patch-1,2,3 config-25% 223670 -0.0% 223641 +24.2% 277734 +37.7% 307991 per_process_ops config-50% 186172 +12.9% 210108 +42.4% 265028 +59.8% 297495 per_process_ops config-100% 89289 +11.3% 99363 +47.4% 131571 +78.1% 158991 per_process_ops And from perf-profile data, the spinlock contention has been greatly reduced: 43.65 -5.8 37.81 -25.9 17.78 -34.4 9.24 self.native_queued_spin_lock_slowpath Some more perf backtrace stack changes are: 50.86 -4.7 46.16 -9.2 41.65 -16.3 34.57 bt.mmap_region.do_mmap.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe 52.99 -4.4 48.55 -8.1 44.93 -14.6 38.35 bt.do_mmap.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap 53.79 -4.4 49.44 -7.6 46.17 -14.0 39.75 bt.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap 54.11 -4.3 49.78 -7.5 46.65 -13.8 40.33 bt.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap 54.21 -4.3 49.89 -7.4 46.81 -13.7 40.50 bt.entry_SYSCALL_64_after_hwframe.__mmap 55.21 -4.2 51.00 -6.8 48.40 -13.0 42.23 bt.__mmap 19.59 -4.1 15.44 -10.3 9.30 -12.6 7.00 bt.___slab_alloc.__kmem_cache_alloc_bulk.kmem_cache_alloc_bulk.mas_alloc_nodes.mas_preallocate 20.25 -4.1 16.16 -9.8 10.40 -12.1 8.15 bt.__kmem_cache_alloc_bulk.kmem_cache_alloc_bulk.mas_alloc_nodes.mas_preallocate.mmap_region 20.52 -4.1 16.46 -9.7 10.80 -11.9 8.60 bt.kmem_cache_alloc_bulk.mas_alloc_nodes.mas_preallocate.mmap_region.do_mmap 21.27 -4.0 17.25 -9.4 11.87 -11.4 9.83 bt.mas_alloc_nodes.mas_preallocate.mmap_region.do_mmap.vm_mmap_pgoff 21.34 -4.0 17.33 -9.4 11.97 -11.4 9.95 bt.mas_preallocate.mmap_region.do_mmap.vm_mmap_pgoff.do_syscall_64 2.60 -2.6 0.00 -2.6 0.00 -2.6 0.00 bt.get_partial_node.get_any_partial.___slab_alloc.__kmem_cache_alloc_bulk.kmem_cache_alloc_bulk 2.77 -2.4 0.35 ± 70% -2.8 0.00 -2.8 0.00 bt.get_any_partial.___slab_alloc.__kmem_cache_alloc_bulk.kmem_cache_alloc_bulk.mas_alloc_nodes Yu Chen also saw the similar slub lock contention in a scheduler related 'hackbench' test, with these debug patches, the contention was also reduced, https://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/ I'll think about how to only apply the changes to big systems and post them as RFC patches. Thanks, Feng