On Wed, 2023-07-26 at 12:06 +0200, Vlastimil Babka wrote: > On 7/25/23 05:13, Hyeonggon Yoo wrote: > > On Mon, Jul 24, 2023 at 11:43 PM Feng Tang <feng.tang@xxxxxxxxx> > > wrote: > > > On Thu, Jul 20, 2023 at 11:05:17PM +0800, Hyeonggon Yoo wrote: > > > > > > > let me introduce our test process. > > > > > > > > > > > > > > we make sure the tests upon commit and its parent have > > > > > > > exact same environment > > > > > > > except the kernel difference, and we also make sure the > > > > > > > config to build the > > > > > > > commit and its parent are identical. > > > > > > > > > > > > > > we run tests for one commit at least 6 times to make sure > > > > > > > the data is stable. > > > > > > > > > > > > > > such like for this case, we rebuild the commit and its > > > > > > > parent's kernel, the > > > > > > > config is attached FYI. > > > > > > > > > > > > Hello Oliver, > > > > > > > > > > > > Thank you for confirming the testing environment is totally > > > > > > fine. > > > > > > and I'm sorry. I didn't mean to offend that your tests were > > > > > > bad. > > > > > > > > > > > > It was more like "oh, the data totally doesn't make sense > > > > > > to me" > > > > > > and I blamed the tests rather than my poor understanding of > > > > > > the data ;) > > > > > > > > > > > > Anyway, > > > > > > as the data shows a repeatable regression, > > > > > > let's think more about the possible scenario: > > > > > > > > > > > > I can't stop thinking that the patch must've affected the > > > > > > system's > > > > > > reclamation behavior in some way. > > > > > > (I think more active anon pages with a similar number total > > > > > > of anon > > > > > > pages implies the kernel scanned more pages) > > > > > > > > > > > > It might be because kswapd was more frequently woken up > > > > > > (possible if > > > > > > skbs were allocated with GFP_ATOMIC) > > > > > > But the data provided is not enough to support this > > > > > > argument. > > > > > > > > > > > > > 2.43 ± 7% +4.5 6.90 ± 11% perf-profile.children.cycles- > > > > > > > pp.get_partial_node > > > > > > > 3.23 ± 5% +4.5 7.77 ± 9% perf- > > > > > > > profile.children.cycles-pp.___slab_alloc > > > > > > > 7.51 ± 2% +4.6 12.11 ± 5% perf- > > > > > > > profile.children.cycles-pp.kmalloc_reserve > > > > > > > 6.94 ± 2% +4.7 11.62 ± 6% perf- > > > > > > > profile.children.cycles-pp.__kmalloc_node_track_caller > > > > > > > 6.46 ± 2% +4.8 11.22 ± 6% perf- > > > > > > > profile.children.cycles-pp.__kmem_cache_alloc_node > > > > > > > 8.48 ± 4% +7.9 16.42 ± 8% perf- > > > > > > > profile.children.cycles-pp._raw_spin_lock_irqsave > > > > > > > 6.12 ± 6% +8.6 14.74 ± 9% perf- > > > > > > > profile.children.cycles- > > > > > > > pp.native_queued_spin_lock_slowpath > > > > > > > > > > > > And this increased cycles in the SLUB slowpath implies that > > > > > > the actual > > > > > > number of objects available in > > > > > > the per cpu partial list has been decreased, possibly > > > > > > because of > > > > > > inaccuracy in the heuristic? > > > > > > (cuz the assumption that slabs cached per are half-filled, > > > > > > and that > > > > > > slabs' order is s->oo) > > > > > > > > > > From the patch: > > > > > > > > > > static unsigned int slub_max_order = > > > > > - IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : > > > > > PAGE_ALLOC_COSTLY_ORDER; > > > > > + IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 2; > > > > > > > > > > Could this be related? that it reduces the order for some > > > > > slab cache, > > > > > so each per-cpu slab will has less objects, which makes the > > > > > contention > > > > > for per-node spinlock 'list_lock' more severe when the slab > > > > > allocation > > > > > is under pressure from many concurrent threads. > > > > > > > > hackbench uses skbuff_head_cache intensively. So we need to > > > > check if > > > > skbuff_head_cache's > > > > order was increased or decreased. On my desktop > > > > skbuff_head_cache's > > > > order is 1 and I roughly > > > > guessed it was increased, (but it's still worth checking in the > > > > testing env) > > > > > > > > But decreased slab order does not necessarily mean decreased > > > > number > > > > of cached objects per CPU, because when oo_order(s->oo) is > > > > smaller, > > > > then it caches > > > > more slabs into the per cpu slab list. > > > > > > > > I think more problematic situation is when oo_order(s->oo) is > > > > higher, > > > > because the heuristic > > > > in SLUB assumes that each slab has order of oo_order(s->oo) and > > > > it's > > > > half-filled. if it allocates > > > > slabs with order lower than oo_order(s->oo), the number of > > > > cached > > > > objects per CPU > > > > decreases drastically due to the inaccurate assumption. > > > > > > > > So yeah, decreased number of cached objects per CPU could be > > > > the cause > > > > of the regression due to the heuristic. > > > > > > > > And I have another theory: it allocated high order slabs from > > > > remote node > > > > even if there are slabs with lower order in the local node. > > > > > > > > ofc we need further experiment, but I think both improving the > > > > accuracy of heuristic and > > > > avoiding allocating high order slabs from remote nodes would > > > > make SLUB > > > > more robust. > > > > > > I run the reproduce command in a local 2-socket box: > > > > > > "/usr/bin/hackbench" "-g" "128" "-f" "20" "--process" "-l" > > > "30000" "-s" "100" > > > > > > And found 2 kmem_cache has been boost: 'kmalloc-cg-512' and > > > 'skbuff_head_cache'. Only order of 'kmalloc-cg-512' was reduced > > > from 3 to 2 with the patch, while its 'cpu_partial_slabs' was > > > bumped > > > from 2 to 4. The setting of 'skbuff_head_cache' was kept > > > unchanged. > > > > > > And this compiled with the perf-profile info from 0Day's report, > > > that the > > > 'list_lock' contention is increased with the patch: > > > > > > 13.71% 13.70% [kernel.kallsyms] [k] > > > native_queued_spin_lock_slowpath - > > > - > > > 5.80% > > > native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;__unfreez > > > e_partials;skb_release_data;consume_skb;unix_stream_read_generic; > > > unix_stream_recvmsg;sock_recvmsg;sock_read_iter;vfs_read;ksys_rea > > > d;do_syscall_64;entry_SYSCALL_64_after_hwframe;__libc_read > > > 5.56% > > > native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;get_parti > > > al_node.part.0;___slab_alloc.constprop.0;__kmem_cache_alloc_node; > > > __kmalloc_node_track_caller;kmalloc_reserve;__alloc_skb;alloc_skb > > > _with_frags;sock_alloc_send_pskb;unix_stream_sendmsg;sock_write_i > > > ter;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwf > > > rame;__libc_write > > > > Oh... neither of the assumptions were not true. > > AFAICS it's a case of decreasing slab order increases lock > > contention, > > Oh good, that would be the least surprising result, at least :) Yeah > I've > pointed out in my reply to this v2 that this patch should not result > in > decreasing slab order, at least for 4k pages. > > The v3/v4 is indeed different in that it only affects 64k pages. But > the > inital goal from v1 to increase the order for 4k is also no longer > there. > Maybe that's fine as there's two things to consider here IMHO. 1) the > order > could be increased for 4k pages for some cache sizes to minimize > waste > (that's what v1 did, but also for 64k where it was not an > improvement) 2) > the orders we have might be too large for 64k pages. Now v4 addresses > 2) > AFAICS. We could return also to 1) separately if it shows benefits. > Yes, so with V4 currently targeting larger page size for slub memory wastage reduction, but will also work on point 1 later on as it shows some benefits :) > In any case it means the benchmark results on v2 are no longer > applicable, > so we could move the discussion to v4: > > https://lore.kernel.org/all/20230720102337.2069722-1-jaypatel@xxxxxxxxxxxxx/ > So any reviews/feedbacks for V4. > Now I noticed in v4 there's only M: folks from the MAINTAINERS slab > section > on Cc: but not R: folks. Next time please Cc: also R: (Hyeonggon and > Roman). > Thanks! > Sure next time will also add R: floks :) Thanks Jay Patel > > The number of cached objects per CPU is mostly the same (not > > exactly same, > > because the cpu slab is not accounted for), but only increases the > > number of slabs > > to process while taking slabs (get_partial_node()), and flushing > > the current > > cpu partial list. (put_cpu_partial() -> __unfreeze_partials()) > > > > Can we do better in this situation? improve __unfreeze_partials()? > > > > > Also I tried to restore the slub_max_order to 3, and the > > > regression was > > > gone. > > > > > > static unsigned int slub_max_order = > > > - IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 2; > > > + IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 3; > > > static unsigned int slub_min_objects; > > > > > > Thanks, > > > Feng > > > > > > > > I don't have direct data to backup it, and I can try some > > > > > experiment. > > > > > > > > Thank you for taking time for experiment! > > > > > > > > Thanks, > > > > Hyeonggon > > > > > > > > > > > then retest on this test machine: > > > > > > > 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ > > > > > > > 2.00GHz (Ice Lake) with 256G memory