On Fri, 9 Aug 2024, kaiyang2@xxxxxxxxxx wrote: > From: Kaiyang Zhao <kaiyang2@xxxxxxxxxx> > > The ability to observe the demotion and promotion decisions made by the > kernel on a per-cgroup basis is important for monitoring and tuning > containerized workloads on either NUMA machines or machines > equipped with tiered memory. > > Different containers in the system may experience drastically different > memory tiering actions that cannot be distinguished from the global > counters alone. > > For example, a container running a workload that has a much hotter > memory accesses will likely see more promotions and fewer demotions, > potentially depriving a colocated container of top tier memory to such > an extent that its performance degrades unacceptably. > > For another example, some containers may exhibit longer periods between > data reuse, causing much more numa_hint_faults than numa_pages_migrated. > In this case, tuning hot_threshold_ms may be appropriate, but the signal > can easily be lost if only global counters are available. > > This patch set adds five counters to > memory.stat in a cgroup: numa_pages_migrated, numa_pte_updates, > numa_hint_faults, pgdemote_kswapd and pgdemote_direct. > > count_memcg_events_mm() is added to count multiple event occurrences at > once, and get_mem_cgroup_from_folio() is added because we need to get a > reference to the memcg of a folio before it's migrated to track > numa_pages_migrated. The accounting of PGDEMOTE_* is moved to > shrink_inactive_list() before being changed to per-cgroup. > > Signed-off-by: Kaiyang Zhao <kaiyang2@xxxxxxxxxx> Hi Kaiyang, have you considered per-memcg control over NUMA balancing operations as well? Wondering if that's the direction that you're heading in, because it would be very useful to be able to control NUMA balancing at memcg granularity on multi-tenant systems. I mentioned this at LSF/MM/BPF this year. If people believe this is out of scope for memcg, that would be good feedback as well.