On Tue, Jul 25, 2023 at 12:13:56PM +0900, Hyeonggon Yoo wrote: [...] > > > > I run the reproduce command in a local 2-socket box: > > > > "/usr/bin/hackbench" "-g" "128" "-f" "20" "--process" "-l" "30000" "-s" "100" > > > > And found 2 kmem_cache has been boost: 'kmalloc-cg-512' and > > 'skbuff_head_cache'. Only order of 'kmalloc-cg-512' was reduced > > from 3 to 2 with the patch, while its 'cpu_partial_slabs' was bumped > > from 2 to 4. The setting of 'skbuff_head_cache' was kept unchanged. > > > > And this compiled with the perf-profile info from 0Day's report, that the > > 'list_lock' contention is increased with the patch: > > > > 13.71% 13.70% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath - - > > 5.80% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;__unfreeze_partials;skb_release_data;consume_skb;unix_stream_read_generic;unix_stream_recvmsg;sock_recvmsg;sock_read_iter;vfs_read;ksys_read;do_syscall_64;entry_SYSCALL_64_after_hwframe;__libc_read > > 5.56% native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;get_partial_node.part.0;___slab_alloc.constprop.0;__kmem_cache_alloc_node;__kmalloc_node_track_caller;kmalloc_reserve;__alloc_skb;alloc_skb_with_frags;sock_alloc_send_pskb;unix_stream_sendmsg;sock_write_iter;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;__libc_write > > Oh... neither of the assumptions were not true. > AFAICS it's a case of decreasing slab order increases lock contention, > > The number of cached objects per CPU is mostly the same (not exactly same, > because the cpu slab is not accounted for), Yes, this makes sense! > but only increases the > number of slabs > to process while taking slabs (get_partial_node()), and flushing the current > cpu partial list. (put_cpu_partial() -> __unfreeze_partials()) > > Can we do better in this situation? improve __unfreeze_partials()? We can check that, IMHO, current MIN_PARTIAL and MAX_PARTIAL are too small as a global parameter, especially for server platforms with hundreds of GB or TBs memory. As for 'list_lock', I'm thinking of bumping the number of per-cpu objects in set_cpu_partial(), at least give user an option to do that for sever platforms with huge mount of memory. Will do some test around it, and let 0Day's peformance testing framework monitor for any regression. Thanks, Feng > > > Also I tried to restore the slub_max_order to 3, and the regression was > > gone. > > > > static unsigned int slub_max_order = > > - IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 2; > > + IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 3; > > static unsigned int slub_min_objects;