Re: [PATCH] [RFC PATCH v2]mm/slub: Optimize slub memory usage

Feng Tang <feng.tang@xxxxxxxxx> · Tue, 29 Aug 2023 16:30:17 +0800

On Tue, Jul 25, 2023 at 05:20:01PM +0800, Tang, Feng wrote:
> On Tue, Jul 25, 2023 at 12:13:56PM +0900, Hyeonggon Yoo wrote:
> [...]
> > >
> > > I run the reproduce command in a local 2-socket box:
> > >
> > > "/usr/bin/hackbench" "-g" "128" "-f" "20" "--process" "-l" "30000" "-s" "100"
> > >
> > > And found 2 kmem_cache has been boost: 'kmalloc-cg-512' and
> > > 'skbuff_head_cache'. Only order of 'kmalloc-cg-512' was reduced
> > > from 3 to 2 with the patch, while its 'cpu_partial_slabs' was bumped
> > > from 2 to 4. The setting of 'skbuff_head_cache' was kept unchanged.
> > >
> > > And this compiled with the perf-profile info from 0Day's report, that the
> > > 'list_lock' contention is increased with the patch:
> > >
> > >     13.71%    13.70%  [kernel.kallsyms]         [k] native_queued_spin_lock_slowpath                            -      -
> > > 5.80% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;__unfreeze_partials;skb_release_data;consume_skb;unix_stream_read_generic;unix_stream_recvmsg;sock_recvmsg;sock_read_iter;vfs_read;ksys_read;do_syscall_64;entry_SYSCALL_64_after_hwframe;__libc_read
> > > 5.56% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;get_partial_node.part.0;___slab_alloc.constprop.0;__kmem_cache_alloc_node;__kmalloc_node_track_caller;kmalloc_reserve;__alloc_skb;alloc_skb_with_frags;sock_alloc_send_pskb;unix_stream_sendmsg;sock_write_iter;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;__libc_write
> > 
> > Oh... neither of the assumptions were not true.
> > AFAICS it's a case of decreasing slab order increases lock contention,
> > 
> > The number of cached objects per CPU is mostly the same (not exactly same,
> > because the cpu slab is not accounted for),
> 
> Yes, this makes sense!
> 
> > but only increases the
> > number of slabs
> > to process while taking slabs (get_partial_node()), and flushing the current
> > cpu partial list. (put_cpu_partial() -> __unfreeze_partials())
> > 
> > Can we do better in this situation? improve __unfreeze_partials()?
> 
> We can check that, IMHO, current MIN_PARTIAL and MAX_PARTIAL are too
> small as a global parameter, especially for server platforms with
> hundreds of GB or TBs memory.
> 
> As for 'list_lock', I'm thinking of bumping the number of per-cpu
> objects in set_cpu_partial(), at least give user an option to do
> that for sever platforms with huge mount of memory. Will do some test
> around it, and let 0Day's peformance testing framework monitor
> for any regression.

Before this performance regression of 'hackbench', I've noticed other
cases where the per-node 'list-lock' is contended. With one processor
(socket/node) can have more and more CPUs (100+ or 200+), the scalability
problem could be much worse. So we may need to tackle it soon or later,
and surely we may need to separate the handling for large platforms
which suffer from scalability issue and small platforms who care more
about memory footprint.

For solving the scalability issue for large systems with big number
of CPU and memory, I tried 3 hacky patches for quick measurement:

1) increase the MIN_PARTIAL and MAX_PARTIAL to let each node have
   more (64) partial slabs in maxim 
2) increase the order of each slab (including changing the max slub
   order to 4)
3) increase number of per-cpu partial slabs

These patches are mostly independent over each other.

And run will-it-scale benchmark's 'mmap1' test case on a 2 socket
Sapphire Rapids server (112 cores, 224 threads) with 256 GB DRAM,
run 3 configurations with parallel test threads of 25%, 50% and
100% of number of CPUs, and the data is (base is vanilla v6.5
kernel):

		     base	            base + patch-1               base + patch-1,2            base + patch-1,2,3
config-25%	    223670            -0.0%     223641           +24.2%     277734           +37.7%     307991        per_process_ops
config-50%	    186172           +12.9%     210108           +42.4%     265028           +59.8%     297495        per_process_ops
config-100%	     89289           +11.3%      99363           +47.4%     131571           +78.1%     158991        per_process_ops

And from perf-profile data, the spinlock contention has been
greatly reduced:

     43.65            -5.8       37.81           -25.9       17.78           -34.4        9.24        self.native_queued_spin_lock_slowpath

Some more perf backtrace stack changes are:

     50.86            -4.7       46.16            -9.2       41.65           -16.3       34.57        bt.mmap_region.do_mmap.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
     52.99            -4.4       48.55            -8.1       44.93           -14.6       38.35        bt.do_mmap.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
     53.79            -4.4       49.44            -7.6       46.17           -14.0       39.75        bt.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
     54.11            -4.3       49.78            -7.5       46.65           -13.8       40.33        bt.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
     54.21            -4.3       49.89            -7.4       46.81           -13.7       40.50        bt.entry_SYSCALL_64_after_hwframe.__mmap
     55.21            -4.2       51.00            -6.8       48.40           -13.0       42.23        bt.__mmap
     19.59            -4.1       15.44           -10.3        9.30           -12.6        7.00        bt.___slab_alloc.__kmem_cache_alloc_bulk.kmem_cache_alloc_bulk.mas_alloc_nodes.mas_preallocate
     20.25            -4.1       16.16            -9.8       10.40           -12.1        8.15        bt.__kmem_cache_alloc_bulk.kmem_cache_alloc_bulk.mas_alloc_nodes.mas_preallocate.mmap_region
     20.52            -4.1       16.46            -9.7       10.80           -11.9        8.60        bt.kmem_cache_alloc_bulk.mas_alloc_nodes.mas_preallocate.mmap_region.do_mmap
     21.27            -4.0       17.25            -9.4       11.87           -11.4        9.83        bt.mas_alloc_nodes.mas_preallocate.mmap_region.do_mmap.vm_mmap_pgoff
     21.34            -4.0       17.33            -9.4       11.97           -11.4        9.95        bt.mas_preallocate.mmap_region.do_mmap.vm_mmap_pgoff.do_syscall_64
      2.60            -2.6        0.00            -2.6        0.00            -2.6        0.00        bt.get_partial_node.get_any_partial.___slab_alloc.__kmem_cache_alloc_bulk.kmem_cache_alloc_bulk
      2.77            -2.4        0.35 ± 70%      -2.8        0.00            -2.8        0.00        bt.get_any_partial.___slab_alloc.__kmem_cache_alloc_bulk.kmem_cache_alloc_bulk.mas_alloc_nodes

Yu Chen also saw the similar slub lock contention in a scheduler
related 'hackbench' test, with these debug patches, the contention was
also reduced, https://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/

I'll think about how to only apply the changes to big systems and post
them as RFC patches.

Thanks,
Feng