Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper size

Eric Dumazet <edumazet@xxxxxxxxxx> · Thu, 11 May 2023 18:35:03 +0200

On Thu, May 11, 2023 at 6:24 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
>
> On Thu, May 11, 2023 at 2:27 AM Zhang, Cathy <cathy.zhang@xxxxxxxxx> wrote:
> >
> >
> >
> [...]
> >
> > Here is the output with the command you paste, it's from system wide,
> > I only show pieces of memcached records, and it seems to be a
> > callee -> caller stack trace:
> >
> >      9.02%  mc-worker        [kernel.vmlinux]          [k] page_counter_try_charge
> >             |
> >              --9.00%--page_counter_try_charge
> >                        |
> >                         --9.00%--try_charge_memcg
> >                                   mem_cgroup_charge_skmem
> >                                   |
> >                                    --9.00%--__sk_mem_raise_allocated
> >                                              __sk_mem_schedule
> >                                              |
> >                                              |--5.32%--tcp_try_rmem_schedule
> >                                              |          tcp_data_queue
> >                                              |          tcp_rcv_established
> >                                              |          tcp_v4_do_rcv
> >                                              |          tcp_v4_rcv
> >                                              |          ip_protocol_deliver_rcu
> >                                              |          ip_local_deliver_finish
> >                                              |          ip_local_deliver
> >                                              |          ip_rcv
> >                                              |          __netif_receive_skb_one_core
> >                                              |          __netif_receive_skb
> >                                              |          process_backlog
> >                                              |          __napi_poll
> >                                              |          net_rx_action
> >                                              |          __do_softirq
> >                                              |          |
> >                                              |           --5.32%--do_softirq.part.0
> >                                              |                     __local_bh_enable_ip
> >                                              |                     __dev_queue_xmit
> >                                              |                     ip_finish_output2
> >                                              |                     __ip_finish_output
> >                                              |                     ip_finish_output
> >                                              |                     ip_output
> >                                              |                     ip_local_out
> >                                              |                     __ip_queue_xmit
> >                                              |                     ip_queue_xmit
> >                                              |                     __tcp_transmit_skb
> >                                              |                     tcp_write_xmit
> >                                              |                     __tcp_push_pending_frames
> >                                              |                     tcp_push
> >                                              |                     tcp_sendmsg_locked
> >                                              |                     tcp_sendmsg
> >                                              |                     inet_sendmsg
> >                                              |                     sock_sendmsg
> >                                              |                     ____sys_sendmsg
> >
> >      8.98%  mc-worker        [kernel.vmlinux]          [k] page_counter_cancel
> >             |
> >              --8.97%--page_counter_cancel
> >                        |
> >                         --8.97%--page_counter_uncharge
> >                                   drain_stock
> >                                   __refill_stock
> >                                   refill_stock
> >                                   |
> >                                    --8.91%--try_charge_memcg
> >                                              mem_cgroup_charge_skmem
> >                                              |
> >                                               --8.91%--__sk_mem_raise_allocated
> >                                                         __sk_mem_schedule
> >                                                         |
> >                                                         |--5.41%--tcp_try_rmem_schedule
> >                                                         |          tcp_data_queue
> >                                                         |          tcp_rcv_established
> >                                                         |          tcp_v4_do_rcv
> >                                                         |          tcp_v4_rcv
> >                                                         |          ip_protocol_deliver_rcu
> >                                                         |          ip_local_deliver_finish
> >                                                         |          ip_local_deliver
> >                                                         |          ip_rcv
> >                                                         |          __netif_receive_skb_one_core
> >                                                         |          __netif_receive_skb
> >                                                         |          process_backlog
> >                                                         |          __napi_poll
> >                                                         |          net_rx_action
> >                                                         |          __do_softirq
> >                                                         |          do_softirq.part.0
> >                                                         |          __local_bh_enable_ip
> >                                                         |          __dev_queue_xmit
> >                                                         |          ip_finish_output2
> >                                                         |          __ip_finish_output
> >                                                         |          ip_finish_output
> >                                                         |          ip_output
> >                                                         |          ip_local_out
> >                                                         |          __ip_queue_xmit
> >                                                         |          ip_queue_xmit
> >                                                         |          __tcp_transmit_skb
> >                                                         |          tcp_write_xmit
> >                                                         |          __tcp_push_pending_frames
> >                                                         |          tcp_push
> >                                                         |          tcp_sendmsg_locked
> >                                                         |          tcp_sendmsg
> >                                                         |          inet_sendmsg
> >
> >      8.78%  mc-worker        [kernel.vmlinux]          [k] try_charge_memcg
> >             |
> >              --8.77%--try_charge_memcg
> >                        |
> >                         --8.76%--mem_cgroup_charge_skmem
> >                                   |
> >                                    --8.76%--__sk_mem_raise_allocated
> >                                              __sk_mem_schedule
> >                                              |
> >                                              |--5.21%--tcp_try_rmem_schedule
> >                                              |          tcp_data_queue
> >                                              |          tcp_rcv_established
> >                                              |          tcp_v4_do_rcv
> >                                              |          |
> >                                              |           --5.21%--tcp_v4_rcv
> >                                              |                     ip_protocol_deliver_rcu
> >                                              |                     ip_local_deliver_finish
> >                                              |                     ip_local_deliver
> >                                              |                     ip_rcv
> >                                              |                     __netif_receive_skb_one_core
> >                                              |                     __netif_receive_skb
> >                                              |                     process_backlog
> >                                              |                     __napi_poll
> >                                              |                     net_rx_action
> >                                              |                     __do_softirq
> >                                              |                     |
> >                                              |                      --5.21%--do_softirq.part.0
> >                                              |                                __local_bh_enable_ip
> >                                              |                                __dev_queue_xmit
> >                                              |                                ip_finish_output2
> >                                              |                                __ip_finish_output
> >                                              |                                ip_finish_output
> >                                              |                                ip_output
> >                                              |                                ip_local_out
> >                                              |                                __ip_queue_xmit
> >                                              |                                ip_queue_xmit
> >                                              |                                __tcp_transmit_skb
> >                                              |                                tcp_write_xmit
> >                                              |                                __tcp_push_pending_frames
> >                                              |                                tcp_push
> >                                              |                                tcp_sendmsg_locked
> >                                              |                                tcp_sendmsg
> >                                              |                                inet_sendmsg
> >                                              |                                sock_sendmsg
> >                                              |                                ____sys_sendmsg
> >                                              |                                ___sys_sendmsg
> >                                              |                                __sys_sendmsg
> >
> >
> > >
>
>
> I am suspecting we are doing a lot of charging for a specific memcg on
> one CPU (or a set of CPUs) and a lot of uncharging on the different
> CPU (or a different set of CPUs) and thus both of these code paths are
> hitting the slow path a lot.
>
> Eric, I remember we have an optimization in the networking stack that
> tries to free the memory on the same CPU where the allocation
> happened. Is that optimization enabled for this code path? Or maybe we
> should do something similar in memcg code (with the assumption that my
> suspicion is correct).

The suspect part is really:

>      8.98%  mc-worker        [kernel.vmlinux]          [k] page_counter_cancel
>             |
>              --8.97%--page_counter_cancel
>                        |
>                         --8.97%--page_counter_uncharge
>                                   drain_stock
>                                   __refill_stock
>                                   refill_stock
>                                   |
>                                    --8.91%--try_charge_memcg
>                                              mem_cgroup_charge_skmem
>                                              |
>                                               --8.91%--__sk_mem_raise_allocated
>                                                         __sk_mem_schedule

Shakeel, networking has a per-cpu cache, of +/- 1MB.

Even with asymmetric alloc/free, this would mean that a 100Gbit NIC
would require something like 25,000
operations on the shared cache line per second.

Hardly an issue I think.

memcg does not seem to have an equivalent strategy ?