Re: [PATCH] mm,memcg: provide per-cgroup counters for NUMA balancing operations

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Fri, 9 Aug 2024 17:28:30 -0700

On Fri,  9 Aug 2024 21:21:15 +0000 kaiyang2@xxxxxxxxxx wrote:

> From: Kaiyang Zhao <kaiyang2@xxxxxxxxxx>
> 
> The ability to observe the demotion and promotion decisions made by the
> kernel on a per-cgroup basis is important for monitoring and tuning
> containerized workloads on either NUMA machines or machines
> equipped with tiered memory.
> 
> Different containers in the system may experience drastically different
> memory tiering actions that cannot be distinguished from the global
> counters alone.
> 
> For example, a container running a workload that has a much hotter
> memory accesses will likely see more promotions and fewer demotions,
> potentially depriving a colocated container of top tier memory to such
> an extent that its performance degrades unacceptably.
> 
> For another example, some containers may exhibit longer periods between
> data reuse, causing much more numa_hint_faults than numa_pages_migrated.
> In this case, tuning hot_threshold_ms may be appropriate, but the signal
> can easily be lost if only global counters are available.
> 
> This patch set adds five counters to
> memory.stat in a cgroup: numa_pages_migrated, numa_pte_updates,
> numa_hint_faults, pgdemote_kswapd and pgdemote_direct.
> 
> count_memcg_events_mm() is added to count multiple event occurrences at
> once, and get_mem_cgroup_from_folio() is added because we need to get a
> reference to the memcg of a folio before it's migrated to track
> numa_pages_migrated. The accounting of PGDEMOTE_* is moved to
> shrink_inactive_list() before being changed to per-cgroup.
> 

Thanks.  I lack the operational experience to be able to judge the
usefulness of this - hopefully others can weigh in.

Meanwhile, the patch is simple enough - I'll queue it up for testing.