On 3/19/25 17:19, Shakeel Butt wrote: > A bit late but let me still propose a session on topics related to memory > cgroups. Last year at LSFMM 2024, we discussed [1] about the potential > deprecation of memcg v1. Since then we have made very good progress in that > regard. We have moved the v1-only code in a separate file and make it not > compile by default, have added warnings in many v1-only interfaces and have > removed a lot of v1-only code. This year, I want to focus on performance of > memory cgroup, particularly improving cost of charging and stats. I'd be very interested in the discussion, I am not there in person, FYI > > At the high level we can partition the memory charging in three cases. First > is the user memory (anon & file), second if kernel memory (slub mostly) and > third is network memory. For network memory, [1] has described some of the > challenges. Similarly for kernel memory, we had to revert patches where memcg > charging was too expensive [3,4]. > > I want to discuss and brainstorm different ways to further optimize the > memcg charging for all these types of memory. I am at the moment prototying > multi-memcg support for per-cpu memcg stocks and would like to see what else > we can do. > What do you mean by multi-memcg support? Does it means creating those buckets per cpu? > One additional interesting observation from our fleet is that the cost of > memory charging increases for the users of memory.low and memory.min. Basically > propagate_protected_usage() becomes very prominently visible in the perf > traces. > > Other than charging, the memcg stats infra also is very expensive and a lot > of CPUs in our fleet are spent on maintaining these stats. Memcg stats use > rstat infrastructure which is designed for fast updates and slow readers. > The updaters put the cgroup in a per-cpu update tree while the stats readers > flushes update trees of all the cpus. For memcg, the flushes has become very > expensive and over the years we have added ratelimiting to limit the cost. > I want to discuss what else we can do to further improve the memcg stats. > Generally anything per-cpu scales well for write, but summing up stats is very expensive. I personally think we might need to consider cases where the limits we enforce allow a certain amount of delta and the watermarks in v2 are a good step in that direction. The one API I've struggled with in v2 is memory_cgroup_handle_over_high(). Ideally, I expected it to act as a soft limit, that when run over and hits max, would cause OOM if needed. Balbir