Re: [PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg

Shakeel Butt <shakeel.butt@xxxxxxxxx> · Wed, 23 Oct 2024 15:17:34 -0700

On Wed, Oct 23, 2024 at 01:34:33PM GMT, Joshua Hahn wrote:
> Changelog
> v2:
>   * Enables the feature only if memcg accounts for hugeTLB usage
>   * Moves the counter from memcg_stat_item to node_stat_item
>   * Expands on motivation & justification in commitlog
>   * Added Suggested-by: Nhat Pham
> 
> This patch introduces a new counter to memory.stat that tracks hugeTLB
> usage, only if hugeTLB accounting is done to memory.current. This
> feature is enabled the same way hugeTLB accounting is enabled, via
> the memory_hugetlb_accounting mount flag for cgroupsv2.
> 
> 1. Why is this patch necessary?
> Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
> hugeTLB usage to memory.current. However, the metric is not reported in
> memory.stat. Given that users often interpret memory.stat as a breakdown
> of the value reported in memory.current, the disparity between the two
> reports can be confusing. This patch solves this problem by including
> the metric in memory.stat as well, but only if it is also reported in
> memory.current (it would also be confusing if the value was reported in
> memory.stat, but not in memory.current)
> 
> Aside from the consistentcy between the two files, we also see benefits
> in observability. Userspace might be interested in the hugeTLB footprint
> of cgroups for many reasons. For instance, system admins might want to
> verify that hugeTLB usage is distributed as expected across tasks: i.e.
> memory-intensive tasks are using more hugeTLB pages than tasks that
> don't consume a lot of memory, or is seen to fault frequently. Note that
> this is separate from wanting to inspect the distribution for limiting
> purposes (in which case, hugeTLB controller makes more sense).
> 
> 2. We already have a hugeTLB controller. Why not use that?
> It is true that hugeTLB tracks the exact value that we want. In fact, by
> enabling the hugeTLB controller, we get all of the observability
> benefits that I mentioned above, and users can check the total hugeTLB
> usage, verify if it is distributed as expected, etc.
> 
> With this said, there are 2 problems:
>   (a) They are still not reported in memory.stat, which means the
>       disparity between the memcg reports are still there.
>   (b) We cannot reasonably expect users to enable the hugeTLB controller
>       just for the sake of hugeTLB usage reporting, especially since
>       they don't have any use for hugeTLB usage enforcing [2].
> 
> [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@xxxxxxxxx/
> [2] Of course, we can't make a new patch for every feature that can be
>     duplicated. However, since the exsting solution of enabling the
>     hugeTLB controller is an imperfect solution that still leaves a
>     discrepancy between memory.stat and memory.curent, I think that it
>     is reasonable to isolate the feature in this case.
> 
> Suggested-by: Nhat Pham <nphamcs@xxxxxxxxx>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@xxxxxxxxx>
> 
> ---
>  include/linux/mmzone.h |  3 +++
>  mm/hugetlb.c           |  4 ++++
>  mm/memcontrol.c        | 11 +++++++++++
>  mm/vmstat.c            |  3 +++
>  4 files changed, 21 insertions(+)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 17506e4a2835..d3ba49a974b2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -215,6 +215,9 @@ enum node_stat_item {
>  #ifdef CONFIG_NUMA_BALANCING
>  	PGPROMOTE_SUCCESS,	/* promote successfully */
>  	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
> +#endif
> +#ifdef CONFIG_HUGETLB_PAGE
> +	HUGETLB_B,

As Yosry pointed out, this is in pages, not bytes. There is already
functionality to display this bin ytes for the readers of the memory
stats.

Also you will need to update Documentation/admin-guide/cgroup-v2.rst to
include the hugetlb stats.

>  #endif
>  	/* PGDEMOTE_*: pages demoted */
>  	PGDEMOTE_KSWAPD,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 190fa05635f4..055bc91858e4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1925,6 +1925,8 @@ void free_huge_folio(struct folio *folio)
>  				     pages_per_huge_page(h), folio);
>  	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
>  					  pages_per_huge_page(h), folio);
> +	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> +		lruvec_stat_mod_folio(folio, HUGETLB_B, -pages_per_huge_page(h));

Please note that by you are adding this stat not only in memcg but also
in global and per-node vmstat. This check will break those interfaces
when this mount option is not used. You only need the check at the
charging time. The uncharging and stats update functions will do the
right thing as they check memcg_data attached to the folio.

>  	mem_cgroup_uncharge(folio);
>  	if (restore_reserve)
>  		h->resv_huge_pages++;
> @@ -3094,6 +3096,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	if (!memcg_charge_ret)
>  		mem_cgroup_commit_charge(folio, memcg);
>  	mem_cgroup_put(memcg);
> +	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)

Same here.

> +		lruvec_stat_mod_folio(folio, HUGETLB_B, pages_per_huge_page(h));
>  
>  	return folio;