On Sun, May 07, 2023 at 07:08:00PM -0700, Cathy Zhang wrote: > Before commit 4890b686f408 ("net: keep sk->sk_forward_alloc as small as > possible"), each TCP can forward allocate up to 2 MB of memory and > tcp_memory_allocated might hit tcp memory limitation quite soon. Not just the system level tcp memory limit but we have actually seen in production unneeded and unexpected memcg OOMs and the commit 4890b686f408 fixes those OOMs as well. > To > reduce the memory pressure, that commit keeps sk->sk_forward_alloc as > small as possible, which will be less than 1 page size if SO_RESERVE_MEM > is not specified. > > However, with commit 4890b686f408 ("net: keep sk->sk_forward_alloc as > small as possible"), memcg charge hot paths are observed while system is > stressed with a large amount of connections. That is because > sk->sk_forward_alloc is too small and it's always less than > sk->truesize, network handlers like tcp_rcv_established() should jump to > slow path more frequently to increase sk->sk_forward_alloc. Each memory > allocation will trigger memcg charge, then perf top shows the following > contention paths on the busy system. > > 16.77% [kernel] [k] page_counter_try_charge > 16.56% [kernel] [k] page_counter_cancel > 15.65% [kernel] [k] try_charge_memcg > > In order to avoid the memcg overhead and performance penalty, IMO this is not the right place to fix memcg performance overhead, specifically because it will re-introduce the memcg OOMs issue. Please fix the memcg overhead in the memcg code. Please share the detail profile of the memcg code. I can help in brainstorming and reviewing the fix. > sk->sk_forward_alloc should be kept with a proper size instead of as > small as possible. Keep memory up to 64KB from reclaims when uncharging > sk_buff memory, which is closer to the maximum size of sk_buff. It will > help reduce the frequency of allocating memory during TCP connection. > The original reclaim threshold for reserved memory per-socket is 2MB, so > the extraneous memory reserved now is about 32 times less than before > commit 4890b686f408 ("net: keep sk->sk_forward_alloc as small as > possible"). > > Run memcached with memtier_benchamrk to verify the optimization fix. 8 > server-client pairs are created with bridge network on localhost, server > and client of the same pair share 28 logical CPUs. > > Results (Average for 5 run) > RPS (with/without patch) +2.07x > Do you have regression data from any production workload? Please keep in mind that many times we (MM subsystem) accepts the regressions of microbenchmarks over complicated optimizations. So, if there is a real production regression, please be very explicit about it.