On Tue, Jul 09, 2024 at 01:20:48PM GMT, Jesper Dangaard Brouer wrote: > Avoid lock contention on the global cgroup rstat lock caused by kswapd > starting on all NUMA nodes simultaneously. At Cloudflare, we observed > massive issues due to kswapd and the specific mem_cgroup_flush_stats() > call inlined in shrink_node, which takes the rstat lock. > > On our 12 NUMA node machines, each with a kswapd kthread per NUMA node, > we noted severe lock contention on the rstat lock. This contention > causes 12 CPUs to waste cycles spinning every time kswapd runs. > Fleet-wide stats (/proc/N/schedstat) for kthreads revealed that we are > burning an average of 20,000 CPU cores fleet-wide on kswapd, primarily > due to spinning on the rstat lock. > > Help reviewers follow code: __alloc_pages_slowpath calls wake_all_kswapds > causing all kswapdN threads to wake up simultaneously. The kswapd thread > invokes shrink_node (via balance_pgdat) triggering the cgroup rstat flush > operation as part of its work. This results in kernel self-induced rstat > lock contention by waking up all kswapd threads simultaneously. Leveraging > this detail: balance_pgdat() have NULL value in target_mem_cgroup, this > cause mem_cgroup_flush_stats() to do flush with root_mem_cgroup. > > To avoid this kind of thundering herd problem, kernel previously had a > "stats_flush_ongoing" concept, but this was removed as part of commit > 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). This patch > reintroduce and generalized the concept to apply to all users of cgroup > rstat, not just memcg. > > If there is an ongoing rstat flush, and current cgroup is a descendant, > then it is unnecessary to do the flush. For callers to still see updated > stats, wait for ongoing flusher to complete before returning, but add > timeout as stats are already inaccurate given updaters keeps running. > > Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). > Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx> > --- > V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.stgit@firesoul/ Does this version fixes the contention you are observing in production for v5?