[PATCH -next] mm: introduce per-node proactive reclaim interface

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This adds support for allowing proactive reclaim in general on a
NUMA system. A per-node interface extends support for beyond a
memcg-specific interface, respecting the current semantics of
memory.reclaim: respecting aging LRU and not supporting
artificially triggering eviction on nodes belonging to non-bottom
tiers.

This patch allows userspace to do:

     echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim

One of the premises for this is to semantically align as best as
possible with memory.reclaim. During a brief time memcg did
support nodemask until 55ab834a86a9 (Revert "mm: add nodes=
arg to memory.reclaim"), for which semantics around reclaim
(eviction) vs demotion were not clear, rendering charging
expectations to be broken.

With this approach:

1. Users who do not use memcg can benefit from proactive reclaim.

2. Proactive reclaim on top tiers will trigger demotion, for which
memory is still byte-addressable. Reclaiming on the bottom nodes
will trigger evicting to swap (the traditional sense of reclaim).
This follows the semantics of what is today part of the aging process
on tiered memory, mirroring what every other form of reclaim does
(reactive and memcg proactive reclaim). Furthermore per-node proactive
reclaim is not as susceptible to the memcg charging problem mentioned
above.

3. Unlike memcg, there should be no surprises of callers expecting
reclaim but instead got a demotion. Essentially relying on behavior
of shrink_folio_list() after 6b426d071419 (mm: disable top-tier
fallback to reclaim on proactive reclaim), without the expectations
of try_to_free_mem_cgroup_pages().

4. Unlike the nodes= arg, this interface avoids confusing semantics,
such as what exactly the user wants when mixing top-tier and low-tier
nodes in the nodemask. Further per-node interface is less exposed to
"free up memory in my container" usecases, where eviction is intended.

5. Users that *really* want to free up memory can use proactive reclaim
on nodes knowingly to be on the bottom tiers to force eviction in a
natural way - higher access latencies are still better than swap.
If compelled, while no guarantees and perhaps not worth the effort,
users could also also potentially follow a ladder-like approach to
eventually free up the memory. Alternatively, perhaps an 'evict' option
could be added to the parameters for both memory.reclaim and per-node
interfaces to force this action unconditionally.

Signed-off-by: Davidlohr Bueso <dave@xxxxxxxxxxxx>
---

This topic has been brought up in the past without much resolution.
But today, I believe a number of semantics and expectations have become
clearer (per the changelog), which could merit revisiting this.

 Documentation/ABI/stable/sysfs-devices-node |  11 ++
 drivers/base/node.c                         |   2 +
 include/linux/swap.h                        |  16 ++
 mm/vmscan.c                                 | 154 ++++++++++++++++----
 4 files changed, 156 insertions(+), 27 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 402af4b2b905..5d69ee956cf9 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -221,3 +221,14 @@ Contact:	Jiaqi Yan <jiaqiyan@xxxxxxxxxx>
 Description:
 		Of the raw poisoned pages on a NUMA node, how many pages are
 		recovered by memory error recovery attempt.
+
+What:		/sys/devices/system/node/nodeX/reclaim
+Date:		September 2024
+Contact:	Linux Memory Management list <linux-mm@xxxxxxxxx>
+Description:
+		This is write-only nested-keyed file which accepts the number of
+		bytes to reclaim as well as the swappiness for this particular
+		operation. Write the amount of bytes to induce memory reclaim in
+		this node. When it completes successfully, the specified amount
+		or more memory will have been reclaimed, and -EAGAIN if less
+		bytes are reclaimed than the specified amount.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index eb72580288e6..d8ed19f8565b 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -626,6 +626,7 @@ static int register_node(struct node *node, int num)
 	} else {
 		hugetlb_register_node(node);
 		compaction_register_node(node);
+		reclaim_register_node(node);
 	}
 
 	return error;
@@ -642,6 +643,7 @@ void unregister_node(struct node *node)
 {
 	hugetlb_unregister_node(node);
 	compaction_unregister_node(node);
+	reclaim_unregister_node(node);
 	node_remove_accesses(node);
 	node_remove_caches(node);
 	device_unregister(&node->dev);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 248db1dd7812..456e3aedb964 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -423,6 +423,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 long remove_mapping(struct address_space *mapping, struct folio *folio);
 
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+extern int reclaim_register_node(struct node *node);
+extern void reclaim_unregister_node(struct node *node);
+
+#else
+
+static inline int reclaim_register_node(struct node *node)
+{
+	return 0;
+}
+
+static inline void reclaim_unregister_node(struct node *node)
+{
+}
+#endif /* CONFIG_SYSFS && CONFIG_NUMA */
+
 #ifdef CONFIG_NUMA
 extern int node_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5dc96a843466..56ddf54366e4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -56,6 +56,7 @@
 #include <linux/khugepaged.h>
 #include <linux/rculist_nulls.h>
 #include <linux/random.h>
+#include <linux/parser.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -92,10 +93,8 @@ struct scan_control {
 	unsigned long	anon_cost;
 	unsigned long	file_cost;
 
-#ifdef CONFIG_MEMCG
 	/* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
 	int *proactive_swappiness;
-#endif
 
 	/* Can active folios be deactivated as part of reclaim? */
 #define DEACTIVATE_ANON 1
@@ -266,6 +265,9 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 
 static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
 {
+	if (sc->proactive && sc->proactive_swappiness)
+		return *sc->proactive_swappiness;
+
 	return READ_ONCE(vm_swappiness);
 }
 #endif
@@ -7470,36 +7472,28 @@ static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
 /*
  * Try to free up some pages from this node through reclaim.
  */
-static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
+static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask,
+			  unsigned long nr_pages, struct scan_control *sc)
 {
-	/* Minimum pages needed in order to stay on node */
-	const unsigned long nr_pages = 1 << order;
 	struct task_struct *p = current;
 	unsigned int noreclaim_flag;
-	struct scan_control sc = {
-		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = current_gfp_context(gfp_mask),
-		.order = order,
-		.priority = NODE_RECLAIM_PRIORITY,
-		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
-		.may_swap = 1,
-		.reclaim_idx = gfp_zone(gfp_mask),
-	};
 	unsigned long pflags;
 
-	trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
-					   sc.gfp_mask);
+	trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, sc->order,
+					   sc->gfp_mask);
 
 	cond_resched();
-	psi_memstall_enter(&pflags);
+
+	if (!sc->proactive)
+		psi_memstall_enter(&pflags);
+
 	delayacct_freepages_start();
-	fs_reclaim_acquire(sc.gfp_mask);
+	fs_reclaim_acquire(sc->gfp_mask);
 	/*
 	 * We need to be able to allocate from the reserves for RECLAIM_UNMAP
 	 */
 	noreclaim_flag = memalloc_noreclaim_save();
-	set_task_reclaim_state(p, &sc.reclaim_state);
+	set_task_reclaim_state(p, &sc->reclaim_state);
 
 	if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages ||
 	    node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) > pgdat->min_slab_pages) {
@@ -7508,24 +7502,38 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 		 * priorities until we have enough memory freed.
 		 */
 		do {
-			shrink_node(pgdat, &sc);
-		} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
+			shrink_node(pgdat, sc);
+		} while (sc->nr_reclaimed < nr_pages && --sc->priority >= 0);
 	}
 
 	set_task_reclaim_state(p, NULL);
 	memalloc_noreclaim_restore(noreclaim_flag);
-	fs_reclaim_release(sc.gfp_mask);
-	psi_memstall_leave(&pflags);
+	fs_reclaim_release(sc->gfp_mask);
 	delayacct_freepages_end();
 
-	trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed);
+	if (!sc->proactive)
+		psi_memstall_leave(&pflags);
+
+	trace_mm_vmscan_node_reclaim_end(sc->nr_reclaimed);
 
-	return sc.nr_reclaimed >= nr_pages;
+	return sc->nr_reclaimed;
 }
 
 int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 {
 	int ret;
+	/* Minimum pages needed in order to stay on node */
+	const unsigned long nr_pages = 1 << order;
+	struct scan_control sc = {
+		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
+		.gfp_mask = current_gfp_context(gfp_mask),
+		.order = order,
+		.priority = NODE_RECLAIM_PRIORITY,
+		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
+		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
+		.may_swap = 1,
+		.reclaim_idx = gfp_zone(gfp_mask),
+	};
 
 	/*
 	 * Node reclaim reclaims unmapped file backed pages and
@@ -7560,7 +7568,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
 		return NODE_RECLAIM_NOSCAN;
 
-	ret = __node_reclaim(pgdat, gfp_mask, order);
+	ret = __node_reclaim(pgdat, gfp_mask, nr_pages, &sc) >= nr_pages;
 	clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
 
 	if (ret)
@@ -7617,3 +7625,95 @@ void check_move_unevictable_folios(struct folio_batch *fbatch)
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_folios);
+
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+
+enum {
+	MEMORY_RECLAIM_SWAPPINESS = 0,
+	MEMORY_RECLAIM_NULL,
+};
+
+static const match_table_t tokens = {
+	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
+	{ MEMORY_RECLAIM_NULL, NULL },
+};
+
+static ssize_t reclaim_store(struct device *dev,
+			     struct device_attribute *attr,
+			     const char *buf, size_t count)
+{
+	int nid = dev->id;
+	gfp_t gfp_mask = GFP_KERNEL;
+	struct pglist_data *pgdat = NODE_DATA(nid);
+	unsigned long nr_to_reclaim, nr_reclaimed = 0;
+	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
+	int swappiness = -1;
+	char *old_buf, *start;
+	substring_t args[MAX_OPT_ARGS];
+	struct scan_control sc = {
+		.gfp_mask = current_gfp_context(gfp_mask),
+		.reclaim_idx = gfp_zone(gfp_mask),
+		.priority = DEF_PRIORITY,
+		.may_writepage = !laptop_mode,
+		.may_unmap = 1,
+		.may_swap = 1,
+		.proactive = 1,
+	};
+
+	buf = strstrip((char *)buf);
+
+	old_buf = (char *)buf;
+	nr_to_reclaim = memparse(buf, (char **)&buf) / PAGE_SIZE;
+	if (buf == old_buf)
+		return -EINVAL;
+
+	buf = strstrip((char *)buf);
+
+	while ((start = strsep((char **)&buf, " ")) != NULL) {
+		if (!strlen(start))
+			continue;
+		switch (match_token(start, tokens, args)) {
+		case MEMORY_RECLAIM_SWAPPINESS:
+			if (match_int(&args[0], &swappiness))
+				return -EINVAL;
+			if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
+				return -EINVAL;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+
+	sc.nr_to_reclaim = max(nr_to_reclaim, SWAP_CLUSTER_MAX);
+	while (nr_reclaimed < nr_to_reclaim) {
+		unsigned long reclaimed;
+
+		if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
+			return -EAGAIN;
+
+		/* does cond_resched() */
+	        reclaimed = __node_reclaim(pgdat, gfp_mask,
+					   nr_to_reclaim - nr_reclaimed, &sc);
+
+		clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
+
+		if (!reclaimed && !nr_retries--)
+			break;
+
+		nr_reclaimed += reclaimed;
+	}
+
+	return nr_reclaimed < nr_to_reclaim ? -EAGAIN : count;
+}
+
+static DEVICE_ATTR_WO(reclaim);
+int reclaim_register_node(struct node *node)
+{
+	return device_create_file(&node->dev, &dev_attr_reclaim);
+}
+
+void reclaim_unregister_node(struct node *node)
+{
+	return device_remove_file(&node->dev, &dev_attr_reclaim);
+}
+#endif
-- 
2.39.2





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux