Re: [merged mm-stable] mm-vmscan-fix-root-proactive-reclaim-unthrottling-unbalanced-node.patch removed from -mm tree

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Fri, 23 Jun 2023 17:06:36 -0700

On Fri, Jun 23, 2023 at 5:00 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
>
> The quilt patch titled
>      Subject: mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
> has been removed from the -mm tree.  Its filename was
>      mm-vmscan-fix-root-proactive-reclaim-unthrottling-unbalanced-node.patch
>
> This patch was dropped because it was merged into the mm-stable branch
> of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

I was really hoping someone would take a look at this, ideally
Johannes because he reported the problem initially.

>
> ------------------------------------------------------
> From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> Subject: mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
> Date: Wed, 21 Jun 2023 02:31:01 +0000
>
> When memory.reclaim was introduced, it became the first case where
> cgroup_reclaim() is true for the root cgroup.  Johannes concluded [1] that
> for most cases this is okay, except for one case.  Historically, kswapd
> would throttle reclaim on a node if a lot of pages marked for reclaim are
> under writeback (aka the node is congested).  This occurred by setting
> LRUVEC_CONGESTED bit in lruvec->flags.  The bit would be cleared when the
> node is balanced.
>
> Similarly, cgroup reclaim would set the same bit when an lruvec is
> congested, and clear it on the way out of reclaim (to throttle local
> reclaimers).
>
> Before the introduction of memory.reclaim, the root memcg was the only
> target of kswapd reclaim, and non-root memcgs were the only targets of
> cgroup reclaim, so they would never interfere.  Using the same bit for
> both was fine.  After memory.reclaim, it is possible for cgroup reclaim on
> the root cgroup to clear the bit set by kswapd.  This would result in
> reclaim on the node to be unthrottled before the node is balanced.
>
> Fix this by introducing separate bits for cgroup-level and node-level
> congestion.  kswapd can unthrottle an lruvec that is marked as congested
> by cgroup reclaim (as the entire node should no longer be congested), but
> not vice versa (to prevent premature unthrottling before the entire node
> is balanced).
>
> [1]https://lore.kernel.org/lkml/20230405200150.GA35884@xxxxxxxxxxx/
>
> Link: https://lkml.kernel.org/r/20230621023101.432780-1-yosryahmed@xxxxxxxxxx
> Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> Reported-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> Closes: https://lore.kernel.org/lkml/20230405200150.GA35884@xxxxxxxxxxx/
> Cc: Michal Hocko <mhocko@xxxxxxxx>
> Cc: Roman Gushchin <roman.gushchin@xxxxxxxxx>
> Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>
> Cc: Muchun Song <songmuchun@xxxxxxxxxxxxx>
> Cc: Yu Zhao <yuzhao@xxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> ---
>
>  include/linux/mmzone.h |   18 +++++++++++++++---
>  mm/vmscan.c            |   19 ++++++++++++-------
>  2 files changed, 27 insertions(+), 10 deletions(-)
>
> --- a/include/linux/mmzone.h~mm-vmscan-fix-root-proactive-reclaim-unthrottling-unbalanced-node
> +++ a/include/linux/mmzone.h
> @@ -293,9 +293,21 @@ static inline bool is_active_lru(enum lr
>  #define ANON_AND_FILE 2
>
>  enum lruvec_flags {
> -       LRUVEC_CONGESTED,               /* lruvec has many dirty pages
> -                                        * backed by a congested BDI
> -                                        */
> +       /*
> +        * An lruvec has many dirty pages backed by a congested BDI:
> +        * 1. LRUVEC_CGROUP_CONGESTED is set by cgroup-level reclaim.
> +        *    It can be cleared by cgroup reclaim or kswapd.
> +        * 2. LRUVEC_NODE_CONGESTED is set by kswapd node-level reclaim.
> +        *    It can only be cleared by kswapd.
> +        *
> +        * Essentially, kswapd can unthrottle an lruvec throttled by cgroup
> +        * reclaim, but not vice versa. This only applies to the root cgroup.
> +        * The goal is to prevent cgroup reclaim on the root cgroup (e.g.
> +        * memory.reclaim) to unthrottle an unbalanced node (that was throttled
> +        * by kswapd).
> +        */
> +       LRUVEC_CGROUP_CONGESTED,
> +       LRUVEC_NODE_CONGESTED,
>  };
>
>  #endif /* !__GENERATING_BOUNDS_H */
> --- a/mm/vmscan.c~mm-vmscan-fix-root-proactive-reclaim-unthrottling-unbalanced-node
> +++ a/mm/vmscan.c
> @@ -6578,10 +6578,13 @@ again:
>          * Legacy memcg will stall in page writeback so avoid forcibly
>          * stalling in reclaim_throttle().
>          */
> -       if ((current_is_kswapd() ||
> -            (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
> -           sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> -               set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
> +       if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested) {
> +               if (cgroup_reclaim(sc) && writeback_throttling_sane(sc))
> +                       set_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags);
> +
> +               if (current_is_kswapd())
> +                       set_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags);
> +       }
>
>         /*
>          * Stall direct reclaim for IO completions if the lruvec is
> @@ -6591,7 +6594,8 @@ again:
>          */
>         if (!current_is_kswapd() && current_may_throttle() &&
>             !sc->hibernation_mode &&
> -           test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
> +           (test_bit(LRUVEC_CGROUP_CONGESTED, &target_lruvec->flags) ||
> +            test_bit(LRUVEC_NODE_CONGESTED, &target_lruvec->flags)))
>                 reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
>
>         if (should_continue_reclaim(pgdat, nr_node_reclaimed, sc))
> @@ -6848,7 +6852,7 @@ retry:
>
>                         lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup,
>                                                    zone->zone_pgdat);
> -                       clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
> +                       clear_bit(LRUVEC_CGROUP_CONGESTED, &lruvec->flags);
>                 }
>         }
>
> @@ -7237,7 +7241,8 @@ static void clear_pgdat_congested(pg_dat
>  {
>         struct lruvec *lruvec = mem_cgroup_lruvec(NULL, pgdat);
>
> -       clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
> +       clear_bit(LRUVEC_NODE_CONGESTED, &lruvec->flags);
> +       clear_bit(LRUVEC_CGROUP_CONGESTED, &lruvec->flags);
>         clear_bit(PGDAT_DIRTY, &pgdat->flags);
>         clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
>  }
> _
>
> Patches currently in -mm which might be from yosryahmed@xxxxxxxxxx are
>
>