On Fri, Jun 23, 2023 at 5:00 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > > The quilt patch titled > Subject: mm/vmscan: fix root proactive reclaim unthrottling unbalanced node > has been removed from the -mm tree. Its filename was > mm-vmscan-fix-root-proactive-reclaim-unthrottling-unbalanced-node.patch > > This patch was dropped because it was merged into the mm-stable branch > of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm I was really hoping someone would take a look at this, ideally Johannes because he reported the problem initially. > > ------------------------------------------------------ > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > Subject: mm/vmscan: fix root proactive reclaim unthrottling unbalanced node > Date: Wed, 21 Jun 2023 02:31:01 +0000 > > When memory.reclaim was introduced, it became the first case where > cgroup_reclaim() is true for the root cgroup. Johannes concluded [1] that > for most cases this is okay, except for one case. Historically, kswapd > would throttle reclaim on a node if a lot of pages marked for reclaim are > under writeback (aka the node is congested). This occurred by setting > LRUVEC_CONGESTED bit in lruvec->flags. The bit would be cleared when the > node is balanced. > > Similarly, cgroup reclaim would set the same bit when an lruvec is > congested, and clear it on the way out of reclaim (to throttle local > reclaimers). > > Before the introduction of memory.reclaim, the root memcg was the only > target of kswapd reclaim, and non-root memcgs were the only targets of > cgroup reclaim, so they would never interfere. Using the same bit for > both was fine. After memory.reclaim, it is possible for cgroup reclaim on > the root cgroup to clear the bit set by kswapd. This would result in > reclaim on the node to be unthrottled before the node is balanced. > > Fix this by introducing separate bits for cgroup-level and node-level > congestion. kswapd can unthrottle an lruvec that is marked as congested > by cgroup reclaim (as the entire node should no longer be congested), but > not vice versa (to prevent premature unthrottling before the entire node > is balanced). > > [1]https://lore.kernel.org/lkml/20230405200150.GA35884@xxxxxxxxxxx/ > > Link: https://lkml.kernel.org/r/20230621023101.432780-1-yosryahmed@xxxxxxxxxx > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > Reported-by: Johannes Weiner <hannes@xxxxxxxxxxx> > Closes: https://lore.kernel.org/lkml/20230405200150.GA35884@xxxxxxxxxxx/ > Cc: Michal Hocko <mhocko@xxxxxxxx> > Cc: Roman Gushchin <roman.gushchin@xxxxxxxxx> > Cc: Shakeel Butt <shakeelb@xxxxxxxxxx> > Cc: Muchun Song <songmuchun@xxxxxxxxxxxxx> > Cc: Yu Zhao <yuzhao@xxxxxxxxxx> > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > --- > > include/linux/mmzone.h | 18 +++++++++++++++--- > mm/vmscan.c | 19 ++++++++++++------- > 2 files changed, 27 insertions(+), 10 deletions(-) > > --- a/include/linux/mmzone.h~mm-vmscan-fix-root-proactive-reclaim-unthrottling-unbalanced-node > +++ a/include/linux/mmzone.h > @@ -293,9 +293,21 @@ static inline bool is_active_lru(enum lr > #define ANON_AND_FILE 2 > > enum lruvec_flags { > - LRUVEC_CONGESTED, /* lruvec has many dirty pages > - * backed by a congested BDI > - */ > + /* > + * An lruvec has many dirty pages backed by a congested BDI: > + * 1. LRUVEC_CGROUP_CONGESTED is set by cgroup-level reclaim. > + * It can be cleared by cgroup reclaim or kswapd. > + * 2. LRUVEC_NODE_CONGESTED is set by kswapd node-level reclaim. > + * It can only be cleared by kswapd. > + * > + * Essentially, kswapd can unthrottle an lruvec throttled by cgroup > + * reclaim, but not vice versa. This only applies to the root cgroup. > + * The goal is to prevent cgroup reclaim on the root cgroup (e.g. > + * memory.reclaim) to unthrottle an unbalanced node (that was throttled > + * by kswapd). > + */ > + LRUVEC_CGROUP_CONGESTED, > + LRUVEC_NODE_CONGESTED, > }; > > #endif /* !__GENERATING_BOUNDS_H */ > --- a/mm/vmscan.c~mm-vmscan-fix-root-proactive-reclaim-unthrottling-unbalanced-node > +++ a/mm/vmscan.c > @@ -6578,10 +6578,13 @@ again: > * Legacy memcg will stall in page writeback so avoid forcibly > * stalling in reclaim_throttle(). > */ > - if ((current_is_kswapd() || > - (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) && > - sc->nr.dirty && sc->nr.dirty == sc->nr.congested) > - set_bit(LRUVEC_CONGESTED, &target_lruvec->flags); > + if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested) { > + if (cgroup_reclaim(sc) && writeback_throttling_sane(sc)) > + set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags); > + > + if (current_is_kswapd()) > + set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags); > + } > > /* > * Stall direct reclaim for IO completions if the lruvec is > @@ -6591,7 +6594,8 @@ again: > */ > if (!current_is_kswapd() && current_may_throttle() && > !sc->hibernation_mode && > - test_bit(LRUVEC_CONGESTED, &target_lruvec->flags)) > + (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) || > + test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags))) > reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED); > > if (should_continue_reclaim(pgdat, nr_node_reclaimed, sc)) > @@ -6848,7 +6852,7 @@ retry: > > lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, > zone->zone_pgdat); > - clear_bit(LRUVEC_CONGESTED, &lruvec->flags); > + clear_bit(LRUVEC_CGROUP_CONGESTED, &lruvec->flags); > } > } > > @@ -7237,7 +7241,8 @@ static void clear_pgdat_congested(pg_dat > { > struct lruvec *lruvec = mem_cgroup_lruvec(NULL, pgdat); > > - clear_bit(LRUVEC_CONGESTED, &lruvec->flags); > + clear_bit(LRUVEC_NODE_CONGESTED, &lruvec->flags); > + clear_bit(LRUVEC_CGROUP_CONGESTED, &lruvec->flags); > clear_bit(PGDAT_DIRTY, &pgdat->flags); > clear_bit(PGDAT_WRITEBACK, &pgdat->flags); > } > _ > > Patches currently in -mm which might be from yosryahmed@xxxxxxxxxx are > >