Re: [LSF/MM/BPF Topic] Performance improvement for Memory Cgroups

Balbir Singh <balbirs@xxxxxxxxxx> · Thu, 20 Mar 2025 16:02:27 +1100

On 3/19/25 17:19, Shakeel Butt wrote:
> A bit late but let me still propose a session on topics related to memory
> cgroups. Last year at LSFMM 2024, we discussed [1] about the potential
> deprecation of memcg v1. Since then we have made very good progress in that
> regard. We have moved the v1-only code in a separate file and make it not
> compile by default, have added warnings in many v1-only interfaces and have
> removed a lot of v1-only code. This year, I want to focus on performance of
> memory cgroup, particularly improving cost of charging and stats.

I'd be very interested in the discussion, I am not there in person, FYI

> 
> At the high level we can partition the memory charging in three cases. First
> is the user memory (anon & file), second if kernel memory (slub mostly) and
> third is network memory. For network memory, [1] has described some of the
> challenges. Similarly for kernel memory, we had to revert patches where memcg
> charging was too expensive [3,4].
> 
> I want to discuss and brainstorm different ways to further optimize the
> memcg charging for all these types of memory. I am at the moment prototying
> multi-memcg support for per-cpu memcg stocks and would like to see what else
> we can do.
> 

What do you mean by multi-memcg support? Does it means creating those buckets
per cpu?

> One additional interesting observation from our fleet is that the cost of 
> memory charging increases for the users of memory.low and memory.min. Basically
> propagate_protected_usage() becomes very prominently visible in the perf
> traces.
> 
> Other than charging, the memcg stats infra also is very expensive and a lot
> of CPUs in our fleet are spent on maintaining these stats. Memcg stats use
> rstat infrastructure which is designed for fast updates and slow readers.
> The updaters put the cgroup in a per-cpu update tree while the stats readers
> flushes update trees of all the cpus. For memcg, the flushes has become very
> expensive and over the years we have added ratelimiting to limit the cost.
> I want to discuss what else we can do to further improve the memcg stats.
> 

Generally anything per-cpu scales well for write, but summing up stats is
very expensive. I personally think we might need to consider cases where
the limits we enforce allow a certain amount of delta and the watermarks
in v2 are a good step in that direction. The one API I've struggled with
in v2 is memory_cgroup_handle_over_high(). Ideally, I expected it to act
as a soft limit, that when run over and hits max, would cause OOM if needed.

Balbir