On Wed, Sep 4, 2024 at 9:28 AM Davidlohr Bueso <dave@xxxxxxxxxxxx> wrote: > > This adds support for allowing proactive reclaim in general on a > NUMA system. A per-node interface extends support for beyond a > memcg-specific interface, respecting the current semantics of > memory.reclaim: respecting aging LRU and not supporting > artificially triggering eviction on nodes belonging to non-bottom > tiers. > > This patch allows userspace to do: > > echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim > > One of the premises for this is to semantically align as best as > possible with memory.reclaim. During a brief time memcg did > support nodemask until 55ab834a86a9 (Revert "mm: add nodes= > arg to memory.reclaim"), for which semantics around reclaim > (eviction) vs demotion were not clear, rendering charging > expectations to be broken. > > With this approach: > > 1. Users who do not use memcg can benefit from proactive reclaim. > > 2. Proactive reclaim on top tiers will trigger demotion, for which > memory is still byte-addressable. Reclaiming on the bottom nodes > will trigger evicting to swap (the traditional sense of reclaim). > This follows the semantics of what is today part of the aging process > on tiered memory, mirroring what every other form of reclaim does > (reactive and memcg proactive reclaim). Furthermore per-node proactive > reclaim is not as susceptible to the memcg charging problem mentioned > above. > > 3. Unlike memcg, there should be no surprises of callers expecting > reclaim but instead got a demotion. Essentially relying on behavior > of shrink_folio_list() after 6b426d071419 (mm: disable top-tier > fallback to reclaim on proactive reclaim), without the expectations > of try_to_free_mem_cgroup_pages(). > > 4. Unlike the nodes= arg, this interface avoids confusing semantics, > such as what exactly the user wants when mixing top-tier and low-tier > nodes in the nodemask. Further per-node interface is less exposed to > "free up memory in my container" usecases, where eviction is intended. > > 5. Users that *really* want to free up memory can use proactive reclaim > on nodes knowingly to be on the bottom tiers to force eviction in a > natural way - higher access latencies are still better than swap. > If compelled, while no guarantees and perhaps not worth the effort, > users could also also potentially follow a ladder-like approach to > eventually free up the memory. Alternatively, perhaps an 'evict' option > could be added to the parameters for both memory.reclaim and per-node > interfaces to force this action unconditionally. > > Signed-off-by: Davidlohr Bueso <dave@xxxxxxxxxxxx> > --- > > This topic has been brought up in the past without much resolution. > But today, I believe a number of semantics and expectations have become > clearer (per the changelog), which could merit revisiting this. > > Documentation/ABI/stable/sysfs-devices-node | 11 ++ > drivers/base/node.c | 2 + > include/linux/swap.h | 16 ++ > mm/vmscan.c | 154 ++++++++++++++++---- > 4 files changed, 156 insertions(+), 27 deletions(-) > > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node > index 402af4b2b905..5d69ee956cf9 100644 > --- a/Documentation/ABI/stable/sysfs-devices-node > +++ b/Documentation/ABI/stable/sysfs-devices-node > @@ -221,3 +221,14 @@ Contact: Jiaqi Yan <jiaqiyan@xxxxxxxxxx> > Description: > Of the raw poisoned pages on a NUMA node, how many pages are > recovered by memory error recovery attempt. > + > +What: /sys/devices/system/node/nodeX/reclaim > +Date: September 2024 > +Contact: Linux Memory Management list <linux-mm@xxxxxxxxx> > +Description: > + This is write-only nested-keyed file which accepts the number of > + bytes to reclaim as well as the swappiness for this particular > + operation. Write the amount of bytes to induce memory reclaim in > + this node. When it completes successfully, the specified amount > + or more memory will have been reclaimed, and -EAGAIN if less > + bytes are reclaimed than the specified amount. > diff --git a/drivers/base/node.c b/drivers/base/node.c > index eb72580288e6..d8ed19f8565b 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -626,6 +626,7 @@ static int register_node(struct node *node, int num) > } else { > hugetlb_register_node(node); > compaction_register_node(node); > + reclaim_register_node(node); > } > > return error; > @@ -642,6 +643,7 @@ void unregister_node(struct node *node) > { > hugetlb_unregister_node(node); > compaction_unregister_node(node); > + reclaim_unregister_node(node); > node_remove_accesses(node); > node_remove_caches(node); > device_unregister(&node->dev); > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 248db1dd7812..456e3aedb964 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -423,6 +423,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages); > extern int vm_swappiness; > long remove_mapping(struct address_space *mapping, struct folio *folio); > > +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) > +extern int reclaim_register_node(struct node *node); > +extern void reclaim_unregister_node(struct node *node); > + > +#else > + > +static inline int reclaim_register_node(struct node *node) > +{ > + return 0; > +} > + > +static inline void reclaim_unregister_node(struct node *node) > +{ > +} > +#endif /* CONFIG_SYSFS && CONFIG_NUMA */ > + > #ifdef CONFIG_NUMA > extern int node_reclaim_mode; > extern int sysctl_min_unmapped_ratio; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 5dc96a843466..56ddf54366e4 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -56,6 +56,7 @@ > #include <linux/khugepaged.h> > #include <linux/rculist_nulls.h> > #include <linux/random.h> > +#include <linux/parser.h> > > #include <asm/tlbflush.h> > #include <asm/div64.h> > @@ -92,10 +93,8 @@ struct scan_control { > unsigned long anon_cost; > unsigned long file_cost; > > -#ifdef CONFIG_MEMCG > /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ > int *proactive_swappiness; > -#endif > > /* Can active folios be deactivated as part of reclaim? */ > #define DEACTIVATE_ANON 1 > @@ -266,6 +265,9 @@ static bool writeback_throttling_sane(struct scan_control *sc) > > static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) > { > + if (sc->proactive && sc->proactive_swappiness) > + return *sc->proactive_swappiness; > + This code is already upstream, right? > return READ_ONCE(vm_swappiness); > } > #endif > @@ -7470,36 +7472,28 @@ static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat) > /* > * Try to free up some pages from this node through reclaim. > */ > -static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) > +static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, > + unsigned long nr_pages, struct scan_control *sc) > { > - /* Minimum pages needed in order to stay on node */ > - const unsigned long nr_pages = 1 << order; > struct task_struct *p = current; > unsigned int noreclaim_flag; > - struct scan_control sc = { > - .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), > - .gfp_mask = current_gfp_context(gfp_mask), > - .order = order, > - .priority = NODE_RECLAIM_PRIORITY, > - .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE), > - .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > - .may_swap = 1, > - .reclaim_idx = gfp_zone(gfp_mask), > - }; > unsigned long pflags; > > - trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, > - sc.gfp_mask); > + trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, sc->order, > + sc->gfp_mask); > > cond_resched(); > - psi_memstall_enter(&pflags); > + > + if (!sc->proactive) > + psi_memstall_enter(&pflags); > + > delayacct_freepages_start(); > - fs_reclaim_acquire(sc.gfp_mask); > + fs_reclaim_acquire(sc->gfp_mask); > /* > * We need to be able to allocate from the reserves for RECLAIM_UNMAP > */ > noreclaim_flag = memalloc_noreclaim_save(); > - set_task_reclaim_state(p, &sc.reclaim_state); > + set_task_reclaim_state(p, &sc->reclaim_state); > > if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages || > node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) > pgdat->min_slab_pages) { > @@ -7508,24 +7502,38 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > * priorities until we have enough memory freed. > */ > do { > - shrink_node(pgdat, &sc); > - } while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0); > + shrink_node(pgdat, sc); > + } while (sc->nr_reclaimed < nr_pages && --sc->priority >= 0); > } > > set_task_reclaim_state(p, NULL); > memalloc_noreclaim_restore(noreclaim_flag); > - fs_reclaim_release(sc.gfp_mask); > - psi_memstall_leave(&pflags); > + fs_reclaim_release(sc->gfp_mask); > delayacct_freepages_end(); > > - trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed); > + if (!sc->proactive) > + psi_memstall_leave(&pflags); > + > + trace_mm_vmscan_node_reclaim_end(sc->nr_reclaimed); > > - return sc.nr_reclaimed >= nr_pages; > + return sc->nr_reclaimed; > } > > int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) > { > int ret; > + /* Minimum pages needed in order to stay on node */ > + const unsigned long nr_pages = 1 << order; > + struct scan_control sc = { > + .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), > + .gfp_mask = current_gfp_context(gfp_mask), > + .order = order, > + .priority = NODE_RECLAIM_PRIORITY, > + .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE), > + .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > + .may_swap = 1, > + .reclaim_idx = gfp_zone(gfp_mask), > + }; > > /* > * Node reclaim reclaims unmapped file backed pages and > @@ -7560,7 +7568,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) > if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags)) > return NODE_RECLAIM_NOSCAN; > > - ret = __node_reclaim(pgdat, gfp_mask, order); > + ret = __node_reclaim(pgdat, gfp_mask, nr_pages, &sc) >= nr_pages; > clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags); > > if (ret) > @@ -7617,3 +7625,95 @@ void check_move_unevictable_folios(struct folio_batch *fbatch) > } > } > EXPORT_SYMBOL_GPL(check_move_unevictable_folios); > + > +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) > + > +enum { > + MEMORY_RECLAIM_SWAPPINESS = 0, > + MEMORY_RECLAIM_NULL, > +}; > + > +static const match_table_t tokens = { > + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, > + { MEMORY_RECLAIM_NULL, NULL }, > +}; > + > +static ssize_t reclaim_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + int nid = dev->id; > + gfp_t gfp_mask = GFP_KERNEL; > + struct pglist_data *pgdat = NODE_DATA(nid); > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > + int swappiness = -1; > + char *old_buf, *start; > + substring_t args[MAX_OPT_ARGS]; > + struct scan_control sc = { > + .gfp_mask = current_gfp_context(gfp_mask), > + .reclaim_idx = gfp_zone(gfp_mask), > + .priority = DEF_PRIORITY, > + .may_writepage = !laptop_mode, > + .may_unmap = 1, > + .may_swap = 1, > + .proactive = 1, > + }; > + > + buf = strstrip((char *)buf); > + > + old_buf = (char *)buf; > + nr_to_reclaim = memparse(buf, (char **)&buf) / PAGE_SIZE; > + if (buf == old_buf) > + return -EINVAL; > + > + buf = strstrip((char *)buf); > + > + while ((start = strsep((char **)&buf, " ")) != NULL) { > + if (!strlen(start)) > + continue; > + switch (match_token(start, tokens, args)) { > + case MEMORY_RECLAIM_SWAPPINESS: > + if (match_int(&args[0], &swappiness)) > + return -EINVAL; > + if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS) > + return -EINVAL; > + break; > + default: > + return -EINVAL; > + } > + } > + > + sc.nr_to_reclaim = max(nr_to_reclaim, SWAP_CLUSTER_MAX); > + while (nr_reclaimed < nr_to_reclaim) { > + unsigned long reclaimed; > + > + if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags)) > + return -EAGAIN; Can the PGDAT_RECLAIM_LOCKED check be moved into __node_reclaim()? They are duplicated in node_reclaim(). > + > + /* does cond_resched() */ > + reclaimed = __node_reclaim(pgdat, gfp_mask, > + nr_to_reclaim - nr_reclaimed, &sc); > + > + clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags); > + > + if (!reclaimed && !nr_retries--) > + break; > + > + nr_reclaimed += reclaimed; > + } In the memcg code (i.e. memory_reclaim()) we also check for pending signals, and drain the LRUs before the last iteration. Do we need this here as well? This leads to my next question: there is a lot of common code with memory_reclaim(). Should we refactor some of it? At least the arguments parsing part looks almost identical. > + > + return nr_reclaimed < nr_to_reclaim ? -EAGAIN : count; > +} > + > +static DEVICE_ATTR_WO(reclaim); > +int reclaim_register_node(struct node *node) > +{ > + return device_create_file(&node->dev, &dev_attr_reclaim); > +} > + > +void reclaim_unregister_node(struct node *node) > +{ > + return device_remove_file(&node->dev, &dev_attr_reclaim); > +} > +#endif > -- > 2.39.2 >