[PATCH v1] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems

Mina Almasry <almasrymina@xxxxxxxxxx> · Fri, 2 Dec 2022 17:11:19 -0800

commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
reclaim"") enabled demotion in memcg reclaim, which is the right thing
to do, however, I suspect it introduced a regression in the behavior of
try_to_free_mem_cgroup_pages().

The callers of try_to_free_mem_cgroup_pages() expect it to attempt to
reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage
of the cgroup should reduce by nr_pages. The callers expect
try_to_free_mem_cgroup_pages() to also return the number of pages
reclaimed, not demoted.

However, what try_to_free_mem_cgroup_pages() actually does is it
unconditionally counts demoted pages as reclaimed pages. So in practice
when it is called it will often demote nr_pages and return the number of
demoted pages to the caller. Demoted pages don't lower the memcg usage,
and so I think try_to_free_mem_cgroup_pages() is not actually doing what
the callers want it to do.

I suspect various things work suboptimally on memory systems or don't
work at all due to this:

- memory.high enforcement likely doesn't work (it just demotes nr_pages
  instead of lowering the memcg usage by nr_pages).
- try_charge_memcg() will keep retrying the charge while
  try_to_free_mem_cgroup_pages() is just demoting pages and not actually
  making any room for the charge.
- memory.reclaim has a wonky interface. It advertises to the user it
  reclaims the provided amount but it will actually demote that amount.

There may be more effects to this issue.

To fix these issues I propose shrink_folio_list() to only count pages
demoted from inside of sc->nodemask to outside of sc->nodemask as
'reclaimed'.

For callers such as reclaim_high() or try_charge_memcg() that set
sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to
actually reclaim nr_pages and return the number of pages reclaimed. No
demoted pages would count towards the nr_pages requirement.

For callers such as memory_reclaim() that set sc->nodemask,
try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask
with either reclaim or demotion.

Tested this change using memory.reclaim interface. With this change,

	echo "1m" > memory.reclaim

Will cause freeing of 1m of memory from the cgroup regardless of the
demotions happening inside.

	echo "1m nodes=0" > memory.reclaim

Will cause freeing of 1m of node 0 by demotion if a demotion target is
available, and by reclaim if no demotion target is available.

Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx>

---

This is developed on top of mm-unstable largely because I need the
memory.reclaim nodes= arg to test it properly.
---
 mm/vmscan.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2b42ac9ad755..8f6e993b870d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1653,6 +1653,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 	LIST_HEAD(free_folios);
 	LIST_HEAD(demote_folios);
 	unsigned int nr_reclaimed = 0;
+	unsigned int nr_demoted = 0;
 	unsigned int pgactivate = 0;
 	bool do_demote_pass;
 	struct swap_iocb *plug = NULL;
@@ -2085,7 +2086,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 	/* 'folio_list' is always empty here */

 	/* Migrate folios selected for demotion */
-	nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
+	nr_demoted = demote_folio_list(&demote_folios, pgdat);
+
+	/*
+	 * Only count demoted folios as reclaimed if we demoted them from
+	 * inside of the nodemask to outside of the nodemask, hence reclaiming
+	 * pages in the nodemask.
+	 */
+	if (sc->nodemask && node_isset(pgdat->node_id, *sc->nodemask) &&
+	    !node_isset(next_demotion_node(pgdat->node_id), *sc->nodemask))
+		nr_reclaimed += nr_demoted;
+
 	/* Folios that could not be demoted are still in @demote_folios */
 	if (!list_empty(&demote_folios)) {
 		/* Folios which weren't demoted go back on @folio_list */
--
2.39.0.rc0.267.gcb52ba06e7-goog