On Thu, May 11, 2023 at 6:24 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > On Thu, May 11, 2023 at 2:27 AM Zhang, Cathy <cathy.zhang@xxxxxxxxx> wrote: > > > > > > > [...] > > > > Here is the output with the command you paste, it's from system wide, > > I only show pieces of memcached records, and it seems to be a > > callee -> caller stack trace: > > > > 9.02% mc-worker [kernel.vmlinux] [k] page_counter_try_charge > > | > > --9.00%--page_counter_try_charge > > | > > --9.00%--try_charge_memcg > > mem_cgroup_charge_skmem > > | > > --9.00%--__sk_mem_raise_allocated > > __sk_mem_schedule > > | > > |--5.32%--tcp_try_rmem_schedule > > | tcp_data_queue > > | tcp_rcv_established > > | tcp_v4_do_rcv > > | tcp_v4_rcv > > | ip_protocol_deliver_rcu > > | ip_local_deliver_finish > > | ip_local_deliver > > | ip_rcv > > | __netif_receive_skb_one_core > > | __netif_receive_skb > > | process_backlog > > | __napi_poll > > | net_rx_action > > | __do_softirq > > | | > > | --5.32%--do_softirq.part.0 > > | __local_bh_enable_ip > > | __dev_queue_xmit > > | ip_finish_output2 > > | __ip_finish_output > > | ip_finish_output > > | ip_output > > | ip_local_out > > | __ip_queue_xmit > > | ip_queue_xmit > > | __tcp_transmit_skb > > | tcp_write_xmit > > | __tcp_push_pending_frames > > | tcp_push > > | tcp_sendmsg_locked > > | tcp_sendmsg > > | inet_sendmsg > > | sock_sendmsg > > | ____sys_sendmsg > > > > 8.98% mc-worker [kernel.vmlinux] [k] page_counter_cancel > > | > > --8.97%--page_counter_cancel > > | > > --8.97%--page_counter_uncharge > > drain_stock > > __refill_stock > > refill_stock > > | > > --8.91%--try_charge_memcg > > mem_cgroup_charge_skmem > > | > > --8.91%--__sk_mem_raise_allocated > > __sk_mem_schedule > > | > > |--5.41%--tcp_try_rmem_schedule > > | tcp_data_queue > > | tcp_rcv_established > > | tcp_v4_do_rcv > > | tcp_v4_rcv > > | ip_protocol_deliver_rcu > > | ip_local_deliver_finish > > | ip_local_deliver > > | ip_rcv > > | __netif_receive_skb_one_core > > | __netif_receive_skb > > | process_backlog > > | __napi_poll > > | net_rx_action > > | __do_softirq > > | do_softirq.part.0 > > | __local_bh_enable_ip > > | __dev_queue_xmit > > | ip_finish_output2 > > | __ip_finish_output > > | ip_finish_output > > | ip_output > > | ip_local_out > > | __ip_queue_xmit > > | ip_queue_xmit > > | __tcp_transmit_skb > > | tcp_write_xmit > > | __tcp_push_pending_frames > > | tcp_push > > | tcp_sendmsg_locked > > | tcp_sendmsg > > | inet_sendmsg > > > > 8.78% mc-worker [kernel.vmlinux] [k] try_charge_memcg > > | > > --8.77%--try_charge_memcg > > | > > --8.76%--mem_cgroup_charge_skmem > > | > > --8.76%--__sk_mem_raise_allocated > > __sk_mem_schedule > > | > > |--5.21%--tcp_try_rmem_schedule > > | tcp_data_queue > > | tcp_rcv_established > > | tcp_v4_do_rcv > > | | > > | --5.21%--tcp_v4_rcv > > | ip_protocol_deliver_rcu > > | ip_local_deliver_finish > > | ip_local_deliver > > | ip_rcv > > | __netif_receive_skb_one_core > > | __netif_receive_skb > > | process_backlog > > | __napi_poll > > | net_rx_action > > | __do_softirq > > | | > > | --5.21%--do_softirq.part.0 > > | __local_bh_enable_ip > > | __dev_queue_xmit > > | ip_finish_output2 > > | __ip_finish_output > > | ip_finish_output > > | ip_output > > | ip_local_out > > | __ip_queue_xmit > > | ip_queue_xmit > > | __tcp_transmit_skb > > | tcp_write_xmit > > | __tcp_push_pending_frames > > | tcp_push > > | tcp_sendmsg_locked > > | tcp_sendmsg > > | inet_sendmsg > > | sock_sendmsg > > | ____sys_sendmsg > > | ___sys_sendmsg > > | __sys_sendmsg > > > > > > > > > > I am suspecting we are doing a lot of charging for a specific memcg on > one CPU (or a set of CPUs) and a lot of uncharging on the different > CPU (or a different set of CPUs) and thus both of these code paths are > hitting the slow path a lot. > > Eric, I remember we have an optimization in the networking stack that > tries to free the memory on the same CPU where the allocation > happened. Is that optimization enabled for this code path? Or maybe we > should do something similar in memcg code (with the assumption that my > suspicion is correct). The suspect part is really: > 8.98% mc-worker [kernel.vmlinux] [k] page_counter_cancel > | > --8.97%--page_counter_cancel > | > --8.97%--page_counter_uncharge > drain_stock > __refill_stock > refill_stock > | > --8.91%--try_charge_memcg > mem_cgroup_charge_skmem > | > --8.91%--__sk_mem_raise_allocated > __sk_mem_schedule Shakeel, networking has a per-cpu cache, of +/- 1MB. Even with asymmetric alloc/free, this would mean that a 100Gbit NIC would require something like 25,000 operations on the shared cache line per second. Hardly an issue I think. memcg does not seem to have an equivalent strategy ?