Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper size

Shakeel Butt <shakeelb@xxxxxxxxxx> · Thu, 11 May 2023 09:23:50 -0700

On Thu, May 11, 2023 at 2:27 AM Zhang, Cathy <cathy.zhang@xxxxxxxxx> wrote:
>
>
>
[...]
>
> Here is the output with the command you paste, it's from system wide,
> I only show pieces of memcached records, and it seems to be a
> callee -> caller stack trace:
>
>      9.02%  mc-worker        [kernel.vmlinux]          [k] page_counter_try_charge
>             |
>              --9.00%--page_counter_try_charge
>                        |
>                         --9.00%--try_charge_memcg
>                                   mem_cgroup_charge_skmem
>                                   |
>                                    --9.00%--__sk_mem_raise_allocated
>                                              __sk_mem_schedule
>                                              |
>                                              |--5.32%--tcp_try_rmem_schedule
>                                              |          tcp_data_queue
>                                              |          tcp_rcv_established
>                                              |          tcp_v4_do_rcv
>                                              |          tcp_v4_rcv
>                                              |          ip_protocol_deliver_rcu
>                                              |          ip_local_deliver_finish
>                                              |          ip_local_deliver
>                                              |          ip_rcv
>                                              |          __netif_receive_skb_one_core
>                                              |          __netif_receive_skb
>                                              |          process_backlog
>                                              |          __napi_poll
>                                              |          net_rx_action
>                                              |          __do_softirq
>                                              |          |
>                                              |           --5.32%--do_softirq.part.0
>                                              |                     __local_bh_enable_ip
>                                              |                     __dev_queue_xmit
>                                              |                     ip_finish_output2
>                                              |                     __ip_finish_output
>                                              |                     ip_finish_output
>                                              |                     ip_output
>                                              |                     ip_local_out
>                                              |                     __ip_queue_xmit
>                                              |                     ip_queue_xmit
>                                              |                     __tcp_transmit_skb
>                                              |                     tcp_write_xmit
>                                              |                     __tcp_push_pending_frames
>                                              |                     tcp_push
>                                              |                     tcp_sendmsg_locked
>                                              |                     tcp_sendmsg
>                                              |                     inet_sendmsg
>                                              |                     sock_sendmsg
>                                              |                     ____sys_sendmsg
>
>      8.98%  mc-worker        [kernel.vmlinux]          [k] page_counter_cancel
>             |
>              --8.97%--page_counter_cancel
>                        |
>                         --8.97%--page_counter_uncharge
>                                   drain_stock
>                                   __refill_stock
>                                   refill_stock
>                                   |
>                                    --8.91%--try_charge_memcg
>                                              mem_cgroup_charge_skmem
>                                              |
>                                               --8.91%--__sk_mem_raise_allocated
>                                                         __sk_mem_schedule
>                                                         |
>                                                         |--5.41%--tcp_try_rmem_schedule
>                                                         |          tcp_data_queue
>                                                         |          tcp_rcv_established
>                                                         |          tcp_v4_do_rcv
>                                                         |          tcp_v4_rcv
>                                                         |          ip_protocol_deliver_rcu
>                                                         |          ip_local_deliver_finish
>                                                         |          ip_local_deliver
>                                                         |          ip_rcv
>                                                         |          __netif_receive_skb_one_core
>                                                         |          __netif_receive_skb
>                                                         |          process_backlog
>                                                         |          __napi_poll
>                                                         |          net_rx_action
>                                                         |          __do_softirq
>                                                         |          do_softirq.part.0
>                                                         |          __local_bh_enable_ip
>                                                         |          __dev_queue_xmit
>                                                         |          ip_finish_output2
>                                                         |          __ip_finish_output
>                                                         |          ip_finish_output
>                                                         |          ip_output
>                                                         |          ip_local_out
>                                                         |          __ip_queue_xmit
>                                                         |          ip_queue_xmit
>                                                         |          __tcp_transmit_skb
>                                                         |          tcp_write_xmit
>                                                         |          __tcp_push_pending_frames
>                                                         |          tcp_push
>                                                         |          tcp_sendmsg_locked
>                                                         |          tcp_sendmsg
>                                                         |          inet_sendmsg
>
>      8.78%  mc-worker        [kernel.vmlinux]          [k] try_charge_memcg
>             |
>              --8.77%--try_charge_memcg
>                        |
>                         --8.76%--mem_cgroup_charge_skmem
>                                   |
>                                    --8.76%--__sk_mem_raise_allocated
>                                              __sk_mem_schedule
>                                              |
>                                              |--5.21%--tcp_try_rmem_schedule
>                                              |          tcp_data_queue
>                                              |          tcp_rcv_established
>                                              |          tcp_v4_do_rcv
>                                              |          |
>                                              |           --5.21%--tcp_v4_rcv
>                                              |                     ip_protocol_deliver_rcu
>                                              |                     ip_local_deliver_finish
>                                              |                     ip_local_deliver
>                                              |                     ip_rcv
>                                              |                     __netif_receive_skb_one_core
>                                              |                     __netif_receive_skb
>                                              |                     process_backlog
>                                              |                     __napi_poll
>                                              |                     net_rx_action
>                                              |                     __do_softirq
>                                              |                     |
>                                              |                      --5.21%--do_softirq.part.0
>                                              |                                __local_bh_enable_ip
>                                              |                                __dev_queue_xmit
>                                              |                                ip_finish_output2
>                                              |                                __ip_finish_output
>                                              |                                ip_finish_output
>                                              |                                ip_output
>                                              |                                ip_local_out
>                                              |                                __ip_queue_xmit
>                                              |                                ip_queue_xmit
>                                              |                                __tcp_transmit_skb
>                                              |                                tcp_write_xmit
>                                              |                                __tcp_push_pending_frames
>                                              |                                tcp_push
>                                              |                                tcp_sendmsg_locked
>                                              |                                tcp_sendmsg
>                                              |                                inet_sendmsg
>                                              |                                sock_sendmsg
>                                              |                                ____sys_sendmsg
>                                              |                                ___sys_sendmsg
>                                              |                                __sys_sendmsg
>
>
> >

I am suspecting we are doing a lot of charging for a specific memcg on
one CPU (or a set of CPUs) and a lot of uncharging on the different
CPU (or a different set of CPUs) and thus both of these code paths are
hitting the slow path a lot.

Eric, I remember we have an optimization in the networking stack that
tries to free the memory on the same CPU where the allocation
happened. Is that optimization enabled for this code path? Or maybe we
should do something similar in memcg code (with the assumption that my
suspicion is correct).