> -----Original Message----- > From: Eric Dumazet <edumazet@xxxxxxxxxx> > Sent: Thursday, May 11, 2023 3:51 PM > To: Zhang, Cathy <cathy.zhang@xxxxxxxxx> > Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>; Linux MM <linux-mm@xxxxxxxxx>; > Cgroups <cgroups@xxxxxxxxxxxxxxx>; Paolo Abeni <pabeni@xxxxxxxxxx>; > davem@xxxxxxxxxxxxx; kuba@xxxxxxxxxx; Brandeburg, Jesse > <jesse.brandeburg@xxxxxxxxx>; Srinivas, Suresh > <suresh.srinivas@xxxxxxxxx>; Chen, Tim C <tim.c.chen@xxxxxxxxx>; You, > Lizhen <lizhen.you@xxxxxxxxx>; eric.dumazet@xxxxxxxxx; > netdev@xxxxxxxxxxxxxxx > Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper > size > > On Thu, May 11, 2023 at 9:00 AM Zhang, Cathy <cathy.zhang@xxxxxxxxx> > wrote: > > > > > > > > > -----Original Message----- > > > From: Zhang, Cathy > > > Sent: Thursday, May 11, 2023 8:53 AM > > > To: Shakeel Butt <shakeelb@xxxxxxxxxx> > > > Cc: Eric Dumazet <edumazet@xxxxxxxxxx>; Linux MM <linux- > > > mm@xxxxxxxxx>; Cgroups <cgroups@xxxxxxxxxxxxxxx>; Paolo Abeni > > > <pabeni@xxxxxxxxxx>; davem@xxxxxxxxxxxxx; kuba@xxxxxxxxxx; > > > Brandeburg, Jesse <jesse.brandeburg@xxxxxxxxx>; Srinivas, Suresh > > > <suresh.srinivas@xxxxxxxxx>; Chen, Tim C <tim.c.chen@xxxxxxxxx>; > > > You, Lizhen <Lizhen.You@xxxxxxxxx>; eric.dumazet@xxxxxxxxx; > > > netdev@xxxxxxxxxxxxxxx > > > Subject: RE: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as > > > a proper size > > > > > > > > > > > > > -----Original Message----- > > > > From: Shakeel Butt <shakeelb@xxxxxxxxxx> > > > > Sent: Thursday, May 11, 2023 3:00 AM > > > > To: Zhang, Cathy <cathy.zhang@xxxxxxxxx> > > > > Cc: Eric Dumazet <edumazet@xxxxxxxxxx>; Linux MM <linux- > > > > mm@xxxxxxxxx>; Cgroups <cgroups@xxxxxxxxxxxxxxx>; Paolo Abeni > > > > <pabeni@xxxxxxxxxx>; davem@xxxxxxxxxxxxx; kuba@xxxxxxxxxx; > > > Brandeburg, > > > > Jesse <jesse.brandeburg@xxxxxxxxx>; Srinivas, Suresh > > > > <suresh.srinivas@xxxxxxxxx>; Chen, Tim C <tim.c.chen@xxxxxxxxx>; > > > > You, Lizhen <lizhen.you@xxxxxxxxx>; eric.dumazet@xxxxxxxxx; > > > > netdev@xxxxxxxxxxxxxxx > > > > Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc > > > > as a proper size > > > > > > > > On Wed, May 10, 2023 at 9:09 AM Zhang, Cathy > > > > <cathy.zhang@xxxxxxxxx> > > > > wrote: > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > Have you tried to increase batch sizes ? > > > > > > > > > > > > > > I jus picked up 256 and 1024 for a try, but no help, the > > > > > > > overhead still > > > > exists. > > > > > > > > > > > > This makes no sense at all. > > > > > > > > > > Eric, > > > > > > > > > > I added a pr_info in try_charge_memcg() to print nr_pages if > > > > > nr_pages > > > > > >= MEMCG_CHARGE_BATCH, except it prints 64 during the > > > > > >initialization > > > > > of instances, there is no other output during the running. That > > > > > means nr_pages is not over 64, I guess that might be the reason > > > > > why to increase MEMCG_CHARGE_BATCH doesn't affect this case. > > > > > > > > > > > > > I am assuming you increased MEMCG_CHARGE_BATCH to 256 and 1024 > > > but > > > > that did not help. To me that just means there is a different > > > > bottleneck in the memcg charging codepath. Can you please share > > > > the perf profile? Please note that memcg charging does a lot of > > > > other things as well like updating memcg stats and checking (and > > > > enforcing) memory.high even if you have not set memory.high. > > > > > > Thanks Shakeel! I will check more details on what you mentioned. We > > > use "sudo perf top -p $(docker inspect -f '{{.State.Pid}}' > > > memcached_2)" to monitor one of those instances, and also use "sudo > > > perf top" to check the overhead from system wide. > > > > Here is the annotate output of perf top for the three memcg hot paths: > > > > Showing cycles for page_counter_try_charge > > Events Pcnt (>=5%) > > Percent | Source code & Disassembly of elf for cycles (543288 samples, > percent: local period) > > -------------------------------------------------------------------------------------------------- > - > > 0.00 : ffffffff8141388d: mov %r12,%rax > > 76.82 : ffffffff81413890: lock xadd %rax,(%rbx) > > 22.10 : ffffffff81413895: lea (%r12,%rax,1),%r15 > > > > > > Showing cycles for page_counter_cancel > > Events Pcnt (>=5%) > > Percent | Source code & Disassembly of elf for cycles (1004744 samples, > percent: local period) > > -------------------------------------------------------------------------------------------------- > -- > > : 160 return i + xadd(&v->counter, i); > > 77.42 : ffffffff81413759: lock xadd %rax,(%rdi) > > 22.34 : ffffffff8141375e: sub %rsi,%rax > > > > > > Showing cycles for try_charge_memcg > > Events Pcnt (>=5%) > > Percent | Source code & Disassembly of elf for cycles (256531 samples, > percent: local period) > > -------------------------------------------------------------------------------------------------- > - > > : 22 return __READ_ONCE((v)->counter); > > 77.53 : ffffffff8141df86: mov 0x100(%r13),%rdx > > : 2826 READ_ONCE(memcg->memory.high); > > 19.45 : ffffffff8141df8d: mov 0x190(%r13),%rcx > > This is rephrasing the info you gave earlier ? Yep, I just want to show some details. > > 16.77% [kernel] [k] page_counter_try_charge > 16.56% [kernel] [k] page_counter_cancel > 15.65% [kernel] [k] try_charge_memcg > > What matters here is a call graph. Thanks for explanation. I re-collect it. > > perf record -a -g sleep 5 # While the test is running perf report --no-children - > -stdio Here is the output with the command you paste, it's from system wide, I only show pieces of memcached records, and it seems to be a callee -> caller stack trace: 9.02% mc-worker [kernel.vmlinux] [k] page_counter_try_charge | --9.00%--page_counter_try_charge | --9.00%--try_charge_memcg mem_cgroup_charge_skmem | --9.00%--__sk_mem_raise_allocated __sk_mem_schedule | |--5.32%--tcp_try_rmem_schedule | tcp_data_queue | tcp_rcv_established | tcp_v4_do_rcv | tcp_v4_rcv | ip_protocol_deliver_rcu | ip_local_deliver_finish | ip_local_deliver | ip_rcv | __netif_receive_skb_one_core | __netif_receive_skb | process_backlog | __napi_poll | net_rx_action | __do_softirq | | | --5.32%--do_softirq.part.0 | __local_bh_enable_ip | __dev_queue_xmit | ip_finish_output2 | __ip_finish_output | ip_finish_output | ip_output | ip_local_out | __ip_queue_xmit | ip_queue_xmit | __tcp_transmit_skb | tcp_write_xmit | __tcp_push_pending_frames | tcp_push | tcp_sendmsg_locked | tcp_sendmsg | inet_sendmsg | sock_sendmsg | ____sys_sendmsg 8.98% mc-worker [kernel.vmlinux] [k] page_counter_cancel | --8.97%--page_counter_cancel | --8.97%--page_counter_uncharge drain_stock __refill_stock refill_stock | --8.91%--try_charge_memcg mem_cgroup_charge_skmem | --8.91%--__sk_mem_raise_allocated __sk_mem_schedule | |--5.41%--tcp_try_rmem_schedule | tcp_data_queue | tcp_rcv_established | tcp_v4_do_rcv | tcp_v4_rcv | ip_protocol_deliver_rcu | ip_local_deliver_finish | ip_local_deliver | ip_rcv | __netif_receive_skb_one_core | __netif_receive_skb | process_backlog | __napi_poll | net_rx_action | __do_softirq | do_softirq.part.0 | __local_bh_enable_ip | __dev_queue_xmit | ip_finish_output2 | __ip_finish_output | ip_finish_output | ip_output | ip_local_out | __ip_queue_xmit | ip_queue_xmit | __tcp_transmit_skb | tcp_write_xmit | __tcp_push_pending_frames | tcp_push | tcp_sendmsg_locked | tcp_sendmsg | inet_sendmsg 8.78% mc-worker [kernel.vmlinux] [k] try_charge_memcg | --8.77%--try_charge_memcg | --8.76%--mem_cgroup_charge_skmem | --8.76%--__sk_mem_raise_allocated __sk_mem_schedule | |--5.21%--tcp_try_rmem_schedule | tcp_data_queue | tcp_rcv_established | tcp_v4_do_rcv | | | --5.21%--tcp_v4_rcv | ip_protocol_deliver_rcu | ip_local_deliver_finish | ip_local_deliver | ip_rcv | __netif_receive_skb_one_core | __netif_receive_skb | process_backlog | __napi_poll | net_rx_action | __do_softirq | | | --5.21%--do_softirq.part.0 | __local_bh_enable_ip | __dev_queue_xmit | ip_finish_output2 | __ip_finish_output | ip_finish_output | ip_output | ip_local_out | __ip_queue_xmit | ip_queue_xmit | __tcp_transmit_skb | tcp_write_xmit | __tcp_push_pending_frames | tcp_push | tcp_sendmsg_locked | tcp_sendmsg | inet_sendmsg | sock_sendmsg | ____sys_sendmsg | ___sys_sendmsg | __sys_sendmsg > > What precise kernel are you using btw ? The above data is collected with 'net-next/main' git branch: base-commit: ed23734c23d2 ("Merge tag 'net-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")