[PATCH v2 1/1] memcg/hugetlb: Adding hugeTLB counters to memcg

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Changelog
v2:
  * Enables the feature only if memcg accounts for hugeTLB usage
  * Moves the counter from memcg_stat_item to node_stat_item
  * Expands on motivation & justification in commitlog
  * Added Suggested-by: Nhat Pham

This patch introduces a new counter to memory.stat that tracks hugeTLB
usage, only if hugeTLB accounting is done to memory.current. This
feature is enabled the same way hugeTLB accounting is enabled, via
the memory_hugetlb_accounting mount flag for cgroupsv2.

1. Why is this patch necessary?
Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
hugeTLB usage to memory.current. However, the metric is not reported in
memory.stat. Given that users often interpret memory.stat as a breakdown
of the value reported in memory.current, the disparity between the two
reports can be confusing. This patch solves this problem by including
the metric in memory.stat as well, but only if it is also reported in
memory.current (it would also be confusing if the value was reported in
memory.stat, but not in memory.current)

Aside from the consistentcy between the two files, we also see benefits
in observability. Userspace might be interested in the hugeTLB footprint
of cgroups for many reasons. For instance, system admins might want to
verify that hugeTLB usage is distributed as expected across tasks: i.e.
memory-intensive tasks are using more hugeTLB pages than tasks that
don't consume a lot of memory, or is seen to fault frequently. Note that
this is separate from wanting to inspect the distribution for limiting
purposes (in which case, hugeTLB controller makes more sense).

2. We already have a hugeTLB controller. Why not use that?
It is true that hugeTLB tracks the exact value that we want. In fact, by
enabling the hugeTLB controller, we get all of the observability
benefits that I mentioned above, and users can check the total hugeTLB
usage, verify if it is distributed as expected, etc.

With this said, there are 2 problems:
  (a) They are still not reported in memory.stat, which means the
      disparity between the memcg reports are still there.
  (b) We cannot reasonably expect users to enable the hugeTLB controller
      just for the sake of hugeTLB usage reporting, especially since
      they don't have any use for hugeTLB usage enforcing [2].

[1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@xxxxxxxxx/
[2] Of course, we can't make a new patch for every feature that can be
    duplicated. However, since the exsting solution of enabling the
    hugeTLB controller is an imperfect solution that still leaves a
    discrepancy between memory.stat and memory.curent, I think that it
    is reasonable to isolate the feature in this case.

Suggested-by: Nhat Pham <nphamcs@xxxxxxxxx>
Signed-off-by: Joshua Hahn <joshua.hahnjy@xxxxxxxxx>

---
 include/linux/mmzone.h |  3 +++
 mm/hugetlb.c           |  4 ++++
 mm/memcontrol.c        | 11 +++++++++++
 mm/vmstat.c            |  3 +++
 4 files changed, 21 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 17506e4a2835..d3ba49a974b2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -215,6 +215,9 @@ enum node_stat_item {
 #ifdef CONFIG_NUMA_BALANCING
 	PGPROMOTE_SUCCESS,	/* promote successfully */
 	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
+#endif
+#ifdef CONFIG_HUGETLB_PAGE
+	HUGETLB_B,
 #endif
 	/* PGDEMOTE_*: pages demoted */
 	PGDEMOTE_KSWAPD,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 190fa05635f4..055bc91858e4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1925,6 +1925,8 @@ void free_huge_folio(struct folio *folio)
 				     pages_per_huge_page(h), folio);
 	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
 					  pages_per_huge_page(h), folio);
+	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
+		lruvec_stat_mod_folio(folio, HUGETLB_B, -pages_per_huge_page(h));
 	mem_cgroup_uncharge(folio);
 	if (restore_reserve)
 		h->resv_huge_pages++;
@@ -3094,6 +3096,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	if (!memcg_charge_ret)
 		mem_cgroup_commit_charge(folio, memcg);
 	mem_cgroup_put(memcg);
+	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
+		lruvec_stat_mod_folio(folio, HUGETLB_B, pages_per_huge_page(h));
 
 	return folio;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7845c64a2c57..de5899eb8203 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -306,6 +306,9 @@ static const unsigned int memcg_node_stat_items[] = {
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	PGPROMOTE_SUCCESS,
+#endif
+#ifdef CONFIG_HUGETLB_PAGE
+	HUGETLB_B,
 #endif
 	PGDEMOTE_KSWAPD,
 	PGDEMOTE_DIRECT,
@@ -1327,6 +1330,9 @@ static const struct memory_stat memory_stats[] = {
 #ifdef CONFIG_ZSWAP
 	{ "zswap",			MEMCG_ZSWAP_B			},
 	{ "zswapped",			MEMCG_ZSWAPPED			},
+#endif
+#ifdef CONFIG_HUGETLB_PAGE
+	{ "hugeTLB",			HUGETLB_B			},
 #endif
 	{ "file_mapped",		NR_FILE_MAPPED			},
 	{ "file_dirty",			NR_FILE_DIRTY			},
@@ -1441,6 +1447,11 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
 
+#ifdef CONFIG_HUGETLB_PAGE
+		if (unlikely(memory_stats[i].idx == HUGETLB_B) &&
+				!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
+			continue;
+#endif
 		size = memcg_page_state_output(memcg, memory_stats[i].idx);
 		seq_buf_printf(s, "%s %llu\n", memory_stats[i].name, size);
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b5a4cea423e1..466c40cffeb0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1269,6 +1269,9 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_NUMA_BALANCING
 	"pgpromote_success",
 	"pgpromote_candidate",
+#endif
+#ifdef CONFIG_HUGETLB_PAGE
+	"hugeTLB",
 #endif
 	"pgdemote_kswapd",
 	"pgdemote_direct",
-- 
2.43.5





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]     [Monitors]

  Powered by Linux