Re: [PATCH -next] mm: introduce per-node proactive reclaim interface

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 4 Sep 2024 14:29:08 -0700

On Wed, Sep 4, 2024 at 9:28 AM Davidlohr Bueso <dave@xxxxxxxxxxxx> wrote:
>
> This adds support for allowing proactive reclaim in general on a
> NUMA system. A per-node interface extends support for beyond a
> memcg-specific interface, respecting the current semantics of
> memory.reclaim: respecting aging LRU and not supporting
> artificially triggering eviction on nodes belonging to non-bottom
> tiers.
>
> This patch allows userspace to do:
>
>      echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim
>
> One of the premises for this is to semantically align as best as
> possible with memory.reclaim. During a brief time memcg did
> support nodemask until 55ab834a86a9 (Revert "mm: add nodes=
> arg to memory.reclaim"), for which semantics around reclaim
> (eviction) vs demotion were not clear, rendering charging
> expectations to be broken.
>
> With this approach:
>
> 1. Users who do not use memcg can benefit from proactive reclaim.
>
> 2. Proactive reclaim on top tiers will trigger demotion, for which
> memory is still byte-addressable. Reclaiming on the bottom nodes
> will trigger evicting to swap (the traditional sense of reclaim).
> This follows the semantics of what is today part of the aging process
> on tiered memory, mirroring what every other form of reclaim does
> (reactive and memcg proactive reclaim). Furthermore per-node proactive
> reclaim is not as susceptible to the memcg charging problem mentioned
> above.
>
> 3. Unlike memcg, there should be no surprises of callers expecting
> reclaim but instead got a demotion. Essentially relying on behavior
> of shrink_folio_list() after 6b426d071419 (mm: disable top-tier
> fallback to reclaim on proactive reclaim), without the expectations
> of try_to_free_mem_cgroup_pages().
>
> 4. Unlike the nodes= arg, this interface avoids confusing semantics,
> such as what exactly the user wants when mixing top-tier and low-tier
> nodes in the nodemask. Further per-node interface is less exposed to
> "free up memory in my container" usecases, where eviction is intended.
>
> 5. Users that *really* want to free up memory can use proactive reclaim
> on nodes knowingly to be on the bottom tiers to force eviction in a
> natural way - higher access latencies are still better than swap.
> If compelled, while no guarantees and perhaps not worth the effort,
> users could also also potentially follow a ladder-like approach to
> eventually free up the memory. Alternatively, perhaps an 'evict' option
> could be added to the parameters for both memory.reclaim and per-node
> interfaces to force this action unconditionally.
>
> Signed-off-by: Davidlohr Bueso <dave@xxxxxxxxxxxx>
> ---
>
> This topic has been brought up in the past without much resolution.
> But today, I believe a number of semantics and expectations have become
> clearer (per the changelog), which could merit revisiting this.
>
>  Documentation/ABI/stable/sysfs-devices-node |  11 ++
>  drivers/base/node.c                         |   2 +
>  include/linux/swap.h                        |  16 ++
>  mm/vmscan.c                                 | 154 ++++++++++++++++----
>  4 files changed, 156 insertions(+), 27 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 402af4b2b905..5d69ee956cf9 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -221,3 +221,14 @@ Contact:   Jiaqi Yan <jiaqiyan@xxxxxxxxxx>
>  Description:
>                 Of the raw poisoned pages on a NUMA node, how many pages are
>                 recovered by memory error recovery attempt.
> +
> +What:          /sys/devices/system/node/nodeX/reclaim
> +Date:          September 2024
> +Contact:       Linux Memory Management list <linux-mm@xxxxxxxxx>
> +Description:
> +               This is write-only nested-keyed file which accepts the number of
> +               bytes to reclaim as well as the swappiness for this particular
> +               operation. Write the amount of bytes to induce memory reclaim in
> +               this node. When it completes successfully, the specified amount
> +               or more memory will have been reclaimed, and -EAGAIN if less
> +               bytes are reclaimed than the specified amount.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index eb72580288e6..d8ed19f8565b 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -626,6 +626,7 @@ static int register_node(struct node *node, int num)
>         } else {
>                 hugetlb_register_node(node);
>                 compaction_register_node(node);
> +               reclaim_register_node(node);
>         }
>
>         return error;
> @@ -642,6 +643,7 @@ void unregister_node(struct node *node)
>  {
>         hugetlb_unregister_node(node);
>         compaction_unregister_node(node);
> +       reclaim_unregister_node(node);
>         node_remove_accesses(node);
>         node_remove_caches(node);
>         device_unregister(&node->dev);
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 248db1dd7812..456e3aedb964 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -423,6 +423,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
>  extern int vm_swappiness;
>  long remove_mapping(struct address_space *mapping, struct folio *folio);
>
> +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> +extern int reclaim_register_node(struct node *node);
> +extern void reclaim_unregister_node(struct node *node);
> +
> +#else
> +
> +static inline int reclaim_register_node(struct node *node)
> +{
> +       return 0;
> +}
> +
> +static inline void reclaim_unregister_node(struct node *node)
> +{
> +}
> +#endif /* CONFIG_SYSFS && CONFIG_NUMA */
> +
>  #ifdef CONFIG_NUMA
>  extern int node_reclaim_mode;
>  extern int sysctl_min_unmapped_ratio;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5dc96a843466..56ddf54366e4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -56,6 +56,7 @@
>  #include <linux/khugepaged.h>
>  #include <linux/rculist_nulls.h>
>  #include <linux/random.h>
> +#include <linux/parser.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -92,10 +93,8 @@ struct scan_control {
>         unsigned long   anon_cost;
>         unsigned long   file_cost;
>
> -#ifdef CONFIG_MEMCG
>         /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
>         int *proactive_swappiness;
> -#endif
>
>         /* Can active folios be deactivated as part of reclaim? */
>  #define DEACTIVATE_ANON 1
> @@ -266,6 +265,9 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>
>  static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
>  {
> +       if (sc->proactive && sc->proactive_swappiness)
> +               return *sc->proactive_swappiness;
> +

This code is already upstream, right?

>         return READ_ONCE(vm_swappiness);
>  }
>  #endif
> @@ -7470,36 +7472,28 @@ static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
>  /*
>   * Try to free up some pages from this node through reclaim.
>   */
> -static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
> +static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask,
> +                         unsigned long nr_pages, struct scan_control *sc)
>  {
> -       /* Minimum pages needed in order to stay on node */
> -       const unsigned long nr_pages = 1 << order;
>         struct task_struct *p = current;
>         unsigned int noreclaim_flag;
> -       struct scan_control sc = {
> -               .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> -               .gfp_mask = current_gfp_context(gfp_mask),
> -               .order = order,
> -               .priority = NODE_RECLAIM_PRIORITY,
> -               .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> -               .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
> -               .may_swap = 1,
> -               .reclaim_idx = gfp_zone(gfp_mask),
> -       };
>         unsigned long pflags;
>
> -       trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
> -                                          sc.gfp_mask);
> +       trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, sc->order,
> +                                          sc->gfp_mask);
>
>         cond_resched();
> -       psi_memstall_enter(&pflags);
> +
> +       if (!sc->proactive)
> +               psi_memstall_enter(&pflags);
> +
>         delayacct_freepages_start();
> -       fs_reclaim_acquire(sc.gfp_mask);
> +       fs_reclaim_acquire(sc->gfp_mask);
>         /*
>          * We need to be able to allocate from the reserves for RECLAIM_UNMAP
>          */
>         noreclaim_flag = memalloc_noreclaim_save();
> -       set_task_reclaim_state(p, &sc.reclaim_state);
> +       set_task_reclaim_state(p, &sc->reclaim_state);
>
>         if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages ||
>             node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) > pgdat->min_slab_pages) {
> @@ -7508,24 +7502,38 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>                  * priorities until we have enough memory freed.
>                  */
>                 do {
> -                       shrink_node(pgdat, &sc);
> -               } while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
> +                       shrink_node(pgdat, sc);
> +               } while (sc->nr_reclaimed < nr_pages && --sc->priority >= 0);
>         }
>
>         set_task_reclaim_state(p, NULL);
>         memalloc_noreclaim_restore(noreclaim_flag);
> -       fs_reclaim_release(sc.gfp_mask);
> -       psi_memstall_leave(&pflags);
> +       fs_reclaim_release(sc->gfp_mask);
>         delayacct_freepages_end();
>
> -       trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed);
> +       if (!sc->proactive)
> +               psi_memstall_leave(&pflags);
> +
> +       trace_mm_vmscan_node_reclaim_end(sc->nr_reclaimed);
>
> -       return sc.nr_reclaimed >= nr_pages;
> +       return sc->nr_reclaimed;
>  }
>
>  int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>  {
>         int ret;
> +       /* Minimum pages needed in order to stay on node */
> +       const unsigned long nr_pages = 1 << order;
> +       struct scan_control sc = {
> +               .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> +               .gfp_mask = current_gfp_context(gfp_mask),
> +               .order = order,
> +               .priority = NODE_RECLAIM_PRIORITY,
> +               .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> +               .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
> +               .may_swap = 1,
> +               .reclaim_idx = gfp_zone(gfp_mask),
> +       };
>
>         /*
>          * Node reclaim reclaims unmapped file backed pages and
> @@ -7560,7 +7568,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>         if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
>                 return NODE_RECLAIM_NOSCAN;
>
> -       ret = __node_reclaim(pgdat, gfp_mask, order);
> +       ret = __node_reclaim(pgdat, gfp_mask, nr_pages, &sc) >= nr_pages;
>         clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
>
>         if (ret)
> @@ -7617,3 +7625,95 @@ void check_move_unevictable_folios(struct folio_batch *fbatch)
>         }
>  }
>  EXPORT_SYMBOL_GPL(check_move_unevictable_folios);
> +
> +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> +
> +enum {
> +       MEMORY_RECLAIM_SWAPPINESS = 0,
> +       MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t tokens = {
> +       { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> +       { MEMORY_RECLAIM_NULL, NULL },
> +};
> +
> +static ssize_t reclaim_store(struct device *dev,
> +                            struct device_attribute *attr,
> +                            const char *buf, size_t count)
> +{
> +       int nid = dev->id;
> +       gfp_t gfp_mask = GFP_KERNEL;
> +       struct pglist_data *pgdat = NODE_DATA(nid);
> +       unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +       unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> +       int swappiness = -1;
> +       char *old_buf, *start;
> +       substring_t args[MAX_OPT_ARGS];
> +       struct scan_control sc = {
> +               .gfp_mask = current_gfp_context(gfp_mask),
> +               .reclaim_idx = gfp_zone(gfp_mask),
> +               .priority = DEF_PRIORITY,
> +               .may_writepage = !laptop_mode,
> +               .may_unmap = 1,
> +               .may_swap = 1,
> +               .proactive = 1,
> +       };
> +
> +       buf = strstrip((char *)buf);
> +
> +       old_buf = (char *)buf;
> +       nr_to_reclaim = memparse(buf, (char **)&buf) / PAGE_SIZE;
> +       if (buf == old_buf)
> +               return -EINVAL;
> +
> +       buf = strstrip((char *)buf);
> +
> +       while ((start = strsep((char **)&buf, " ")) != NULL) {
> +               if (!strlen(start))
> +                       continue;
> +               switch (match_token(start, tokens, args)) {
> +               case MEMORY_RECLAIM_SWAPPINESS:
> +                       if (match_int(&args[0], &swappiness))
> +                               return -EINVAL;
> +                       if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
> +                               return -EINVAL;
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +
> +       sc.nr_to_reclaim = max(nr_to_reclaim, SWAP_CLUSTER_MAX);
> +       while (nr_reclaimed < nr_to_reclaim) {
> +               unsigned long reclaimed;
> +
> +               if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
> +                       return -EAGAIN;

Can the PGDAT_RECLAIM_LOCKED check be moved into __node_reclaim()?
They are duplicated in node_reclaim().

> +
> +               /* does cond_resched() */
> +               reclaimed = __node_reclaim(pgdat, gfp_mask,
> +                                          nr_to_reclaim - nr_reclaimed, &sc);
> +
> +               clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
> +
> +               if (!reclaimed && !nr_retries--)
> +                       break;
> +
> +               nr_reclaimed += reclaimed;
> +       }

In the memcg code (i.e. memory_reclaim()) we also check for pending
signals, and drain the LRUs before the last iteration. Do we need this
here as well?

This leads to my next question: there is a lot of common code with
memory_reclaim(). Should we refactor some of it? At least the
arguments parsing part looks almost identical.

> +
> +       return nr_reclaimed < nr_to_reclaim ? -EAGAIN : count;
> +}
> +
> +static DEVICE_ATTR_WO(reclaim);
> +int reclaim_register_node(struct node *node)
> +{
> +       return device_create_file(&node->dev, &dev_attr_reclaim);
> +}
> +
> +void reclaim_unregister_node(struct node *node)
> +{
> +       return device_remove_file(&node->dev, &dev_attr_reclaim);
> +}
> +#endif
> --
> 2.39.2
>