This is a note to let you know that I've just added the patch titled mm/mglru: try to stop at high watermarks to the 6.6-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: mm-mglru-try-to-stop-at-high-watermarks.patch and it can be found in the queue-6.6 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let <stable@xxxxxxxxxxxxxxx> know about it. >From 5095a2b23987d3c3c47dd16b3d4080e2733b8bb9 Mon Sep 17 00:00:00 2001 From: Yu Zhao <yuzhao@xxxxxxxxxx> Date: Thu, 7 Dec 2023 23:14:05 -0700 Subject: mm/mglru: try to stop at high watermarks From: Yu Zhao <yuzhao@xxxxxxxxxx> commit 5095a2b23987d3c3c47dd16b3d4080e2733b8bb9 upstream. The initial MGLRU patchset didn't include the memcg LRU support, and it relied on should_abort_scan(), added by commit f76c83378851 ("mm: multi-gen LRU: optimize multiple memcgs"), to "backoff to avoid overshooting their aggregate reclaim target by too much". Later on when the memcg LRU was added, should_abort_scan() was deemed unnecessary, and the test results [1] showed no side effects after it was removed by commit a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard"). However, that test used memory.reclaim, which sets nr_to_reclaim to SWAP_CLUSTER_MAX. So it can overshoot only by SWAP_CLUSTER_MAX-1 pages, i.e., from nr_reclaimed=nr_to_reclaim-1 to nr_reclaimed=nr_to_reclaim+SWAP_CLUSTER_MAX-1. Compared with the batch size kswapd sets to nr_to_reclaim, SWAP_CLUSTER_MAX is tiny. Therefore that test isn't able to reproduce the worst case scenario, i.e., kswapd overshooting GBs on large systems and "consuming 100% CPU" (see the Closes tag). Bring back a simplified version of should_abort_scan() on top of the memcg LRU, so that kswapd stops when all eligible zones are above their respective high watermarks plus a small delta to lower the chance of KSWAPD_HIGH_WMARK_HIT_QUICKLY. Note that this only applies to order-0 reclaim, meaning compaction-induced reclaim can still run wild (which is a different problem). On Android, launching 55 apps sequentially: Before After Change pgpgin 838377172 802955040 -4% pgpgout 38037080 34336300 -10% [1] https://lore.kernel.org/20221222041905.2431096-1-yuzhao@xxxxxxxxxx/ Link: https://lkml.kernel.org/r/20231208061407.2125867-2-yuzhao@xxxxxxxxxx Fixes: a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard") Signed-off-by: Yu Zhao <yuzhao@xxxxxxxxxx> Reported-by: Charan Teja Kalla <quic_charante@xxxxxxxxxxx> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@xxxxxxxxxxxx> Closes: https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@xxxxxxxxxxxxxx/ Tested-by: Jaroslav Pulchart <jaroslav.pulchart@xxxxxxxxxxxx> Tested-by: Kalesh Singh <kaleshsingh@xxxxxxxxxx> Cc: Hillf Danton <hdanton@xxxxxxxx> Cc: Kairui Song <ryncsn@xxxxxxxxx> Cc: T.J. Mercier <tjmercier@xxxxxxxxxx> Cc: <stable@xxxxxxxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> --- mm/vmscan.c | 36 ++++++++++++++++++++++++++++-------- 1 file changed, 28 insertions(+), 8 deletions(-) --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -5341,20 +5341,41 @@ static long get_nr_to_scan(struct lruvec return try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false) ? -1 : 0; } -static unsigned long get_nr_to_reclaim(struct scan_control *sc) +static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc) { + int i; + enum zone_watermarks mark; + /* don't abort memcg reclaim to ensure fairness */ if (!root_reclaim(sc)) - return -1; + return false; + + if (sc->nr_reclaimed >= max(sc->nr_to_reclaim, compact_gap(sc->order))) + return true; + + /* check the order to exclude compaction-induced reclaim */ + if (!current_is_kswapd() || sc->order) + return false; + + mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ? + WMARK_PROMO : WMARK_HIGH; + + for (i = 0; i <= sc->reclaim_idx; i++) { + struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i; + unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH; + + if (managed_zone(zone) && !zone_watermark_ok(zone, 0, size, sc->reclaim_idx, 0)) + return false; + } - return max(sc->nr_to_reclaim, compact_gap(sc->order)); + /* kswapd should abort if all eligible zones are safe */ + return true; } static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) { long nr_to_scan; unsigned long scanned = 0; - unsigned long nr_to_reclaim = get_nr_to_reclaim(sc); int swappiness = get_swappiness(lruvec, sc); /* clean file folios are more likely to exist */ @@ -5376,7 +5397,7 @@ static bool try_to_shrink_lruvec(struct if (scanned >= nr_to_scan) break; - if (sc->nr_reclaimed >= nr_to_reclaim) + if (should_abort_scan(lruvec, sc)) break; cond_resched(); @@ -5437,7 +5458,6 @@ static void shrink_many(struct pglist_da struct lru_gen_folio *lrugen; struct mem_cgroup *memcg; const struct hlist_nulls_node *pos; - unsigned long nr_to_reclaim = get_nr_to_reclaim(sc); bin = first_bin = get_random_u32_below(MEMCG_NR_BINS); restart: @@ -5470,7 +5490,7 @@ restart: rcu_read_lock(); - if (sc->nr_reclaimed >= nr_to_reclaim) + if (should_abort_scan(lruvec, sc)) break; } @@ -5481,7 +5501,7 @@ restart: mem_cgroup_put(memcg); - if (sc->nr_reclaimed >= nr_to_reclaim) + if (!is_a_nulls(pos)) return; /* restart if raced with lru_gen_rotate_memcg() */ Patches currently in stable-queue which might be from yuzhao@xxxxxxxxxx are queue-6.6/mm-mglru-fix-underprotected-page-cache.patch queue-6.6/mm-mglru-respect-min_ttl_ms-with-memcgs.patch queue-6.6/mm-mglru-try-to-stop-at-high-watermarks.patch queue-6.6/mm-mglru-reclaim-offlined-memcgs-harder.patch