Re: [nacked] mm-add-swapiness=-arg-to-memoryreclaim.patch removed from -mm tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue 25-06-24 14:03:33, Andrew Morton wrote:
> 
> The quilt patch titled
>      Subject: mm: add swappiness= arg to memory.reclaim
> has been removed from the -mm tree.  Its filename was
>      mm-add-swapiness=-arg-to-memoryreclaim.patch
> 
> This patch was dropped because it was nacked

I do not see this being nacked in the original email thread.

> ------------------------------------------------------
> From: Dan Schatzberg <schatzberg.dan@xxxxxxxxx>
> Subject: mm: add swappiness= arg to memory.reclaim
> Date: Wed, 3 Jan 2024 08:48:37 -0800
> 
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim.  This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.
> 
> For example:
> 
> echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
> 
> will perform reclaim on the rootcg with a swappiness setting of 0 (no
> swap) regardless of the vm.swappiness sysctl setting.
> 
> Userspace proactive reclaimers use the memory.reclaim interface to trigger
> reclaim.  The memory.reclaim interface does not allow for any way to
> effect the balance of file vs anon during proactive reclaim.  The only
> approach is to adjust the vm.swappiness setting.  However, there are a few
> reasons we look to control the balance of file vs anon during proactive
> reclaim, separately from reactive reclaim:
> 
> * Swapout should be limited to manage SSD write endurance.  In near-OOM
>   situations we are fine with lots of swap-out to avoid OOMs.  As these
>   are typically rare events, they have relatively little impact on write
>   endurance.  However, proactive reclaim runs continuously and so its
>   impact on SSD write endurance is more significant.  Therefore it is
>   desireable to control swap-out for proactive reclaim separately from
>   reactive reclaim
> 
> * Some userspace OOM killers like systemd-oomd[1] support OOM killing on
>   swap exhaustion.  This makes sense if the swap exhaustion is triggered
>   due to reactive reclaim but less so if it is triggered due to proactive
>   reclaim (e.g.  one could see OOMs when free memory is ample but anon is
>   just particularly cold).  Therefore, it's desireable to have proactive
>   reclaim reduce or stop swap-out before the threshold at which OOM
>   killing occurs.
> 
> In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness
> before writes to memory.reclaim[2].  This has been in production for
> nearly two years and has addressed our needs to control proactive vs
> reactive reclaim behavior but is still not ideal for a number of reasons:
> 
> * vm.swappiness is a global setting, adjusting it can race/interfere
>   with other system administration that wishes to control vm.swappiness. 
>   In our case, we need to disable Senpai before adjusting vm.swappiness.
> 
> * vm.swappiness is stateful - so a crash or restart of Senpai can leave
>   a misconfigured setting.  This requires some additional management to
>   record the "desired" setting and ensure Senpai always adjusts to it.
> 
> With this patch, we avoid these downsides of adjusting vm.swappiness
> globally.
> 
> [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
> 
> Link: https://lkml.kernel.org/r/20240103164841.2800183-3-schatzberg.dan@xxxxxxxxx
> Signed-off-by: Dan Schatzberg <schatzberg.dan@xxxxxxxxx>
> Suggested-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> Acked-by: Michal Hocko <mhocko@xxxxxxxx>
> Acked-by: David Rientjes <rientjes@xxxxxxxxxx>
> Acked-by: Chris Li <chrisl@xxxxxxxxxx>
> Cc: David Hildenbrand <david@xxxxxxxxxx>
> Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Jonathan Corbet <corbet@xxxxxxx>
> Cc: Kefeng Wang <wangkefeng.wang@xxxxxxxxxx>
> Cc: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
> Cc: Muchun Song <muchun.song@xxxxxxxxx>
> Cc: Roman Gushchin <roman.gushchin@xxxxxxxxx>
> Cc: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>
> Cc: Tejun Heo <tj@xxxxxxxxxx>
> Cc: Yue Zhao <findns94@xxxxxxxxx>
> Cc: Zefan Li <lizefan.x@xxxxxxxxxxxxx>
> Cc: Nhat Pham <nphamcs@xxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> ---
> 
>  Documentation/admin-guide/cgroup-v2.rst |   18 ++++--
>  include/linux/swap.h                    |    3 -
>  mm/memcontrol.c                         |   57 +++++++++++++++++-----
>  mm/vmscan.c                             |   25 ++++++++-
>  4 files changed, 81 insertions(+), 22 deletions(-)
> 
> --- a/Documentation/admin-guide/cgroup-v2.rst~mm-add-swapiness=-arg-to-memoryreclaim
> +++ a/Documentation/admin-guide/cgroup-v2.rst
> @@ -1299,17 +1299,10 @@ PAGE_SIZE multiple when read back.
>  	This is a simple interface to trigger memory reclaim in the
>  	target cgroup.
>  
> -	This file accepts a single key, the number of bytes to reclaim.
> -	No nested keys are currently supported.
> -
>  	Example::
>  
>  	  echo "1G" > memory.reclaim
>  
> -	The interface can be later extended with nested keys to
> -	configure the reclaim behavior. For example, specify the
> -	type of memory to reclaim from (anon, file, ..).
> -
>  	Please note that the kernel can over or under reclaim from
>  	the target cgroup. If less bytes are reclaimed than the
>  	specified amount, -EAGAIN is returned.
> @@ -1321,6 +1314,17 @@ PAGE_SIZE multiple when read back.
>  	This means that the networking layer will not adapt based on
>  	reclaim induced by memory.reclaim.
>  
> +The following nested keys are defined.
> +
> +	  ==========            ================================
> +	  swappiness            Swappiness value to reclaim with
> +	  ==========            ================================
> +
> +	Specifying a swappiness value instructs the kernel to perform
> +	the reclaim with that swappiness value. Note that this has the
> +	same semantics as vm.swappiness applied to memcg reclaim with
> +	all the existing limitations and potential future extensions.
> +
>    memory.peak
>  	A read-only single value file which exists on non-root
>  	cgroups.
> --- a/include/linux/swap.h~mm-add-swapiness=-arg-to-memoryreclaim
> +++ a/include/linux/swap.h
> @@ -410,7 +410,8 @@ extern unsigned long try_to_free_pages(s
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  						  unsigned long nr_pages,
>  						  gfp_t gfp_mask,
> -						  unsigned int reclaim_options);
> +						  unsigned int reclaim_options,
> +						  int *swappiness);
>  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>  						gfp_t gfp_mask, bool noswap,
>  						pg_data_t *pgdat,
> --- a/mm/memcontrol.c~mm-add-swapiness=-arg-to-memoryreclaim
> +++ a/mm/memcontrol.c
> @@ -53,6 +53,7 @@
>  #include <linux/sort.h>
>  #include <linux/fs.h>
>  #include <linux/seq_file.h>
> +#include <linux/parser.h>
>  #include <linux/vmpressure.h>
>  #include <linux/memremap.h>
>  #include <linux/mm_inline.h>
> @@ -2636,7 +2637,8 @@ static unsigned long reclaim_high(struct
>  		psi_memstall_enter(&pflags);
>  		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
>  							gfp_mask,
> -							MEMCG_RECLAIM_MAY_SWAP);
> +							MEMCG_RECLAIM_MAY_SWAP,
> +							NULL);
>  		psi_memstall_leave(&pflags);
>  	} while ((memcg = parent_mem_cgroup(memcg)) &&
>  		 !mem_cgroup_is_root(memcg));
> @@ -2942,7 +2944,7 @@ retry:
>  
>  	psi_memstall_enter(&pflags);
>  	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> -						    gfp_mask, reclaim_options);
> +						    gfp_mask, reclaim_options, NULL);
>  	psi_memstall_leave(&pflags);
>  
>  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -3911,7 +3913,7 @@ static int mem_cgroup_resize_max(struct
>  		}
>  
>  		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
> +					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
>  			ret = -EBUSY;
>  			break;
>  		}
> @@ -4025,7 +4027,7 @@ static int mem_cgroup_force_empty(struct
>  			return -EINTR;
>  
>  		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -						  MEMCG_RECLAIM_MAY_SWAP))
> +						  MEMCG_RECLAIM_MAY_SWAP, NULL))
>  			nr_retries--;
>  	}
>  
> @@ -7000,7 +7002,7 @@ static ssize_t memory_high_write(struct
>  		}
>  
>  		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
> +					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
>  
>  		if (!reclaimed && !nr_retries--)
>  			break;
> @@ -7049,7 +7051,7 @@ static ssize_t memory_max_write(struct k
>  
>  		if (nr_reclaims) {
>  			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> -					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
> +					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL))
>  				nr_reclaims--;
>  			continue;
>  		}
> @@ -7179,19 +7181,50 @@ static ssize_t memory_oom_group_write(st
>  	return nbytes;
>  }
>  
> +enum {
> +	MEMORY_RECLAIM_SWAPPINESS = 0,
> +	MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t tokens = {
> +	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> +	{ MEMORY_RECLAIM_NULL, NULL },
> +};
> +
>  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>  			      size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>  	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>  	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +	int swappiness = -1;
>  	unsigned int reclaim_options;
> -	int err;
> +	char *old_buf, *start;
> +	substring_t args[MAX_OPT_ARGS];
>  
>  	buf = strstrip(buf);
> -	err = page_counter_memparse(buf, "", &nr_to_reclaim);
> -	if (err)
> -		return err;
> +
> +	old_buf = buf;
> +	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> +	if (buf == old_buf)
> +		return -EINVAL;
> +
> +	buf = strstrip(buf);
> +
> +	while ((start = strsep(&buf, " ")) != NULL) {
> +		if (!strlen(start))
> +			continue;
> +		switch (match_token(start, tokens, args)) {
> +		case MEMORY_RECLAIM_SWAPPINESS:
> +			if (match_int(&args[0], &swappiness))
> +				return -EINVAL;
> +			if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
> +				return -EINVAL;
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +	}
>  
>  	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
>  	while (nr_reclaimed < nr_to_reclaim) {
> @@ -7211,7 +7244,9 @@ static ssize_t memory_reclaim(struct ker
>  			lru_add_drain_all();
>  
>  		reclaimed = try_to_free_mem_cgroup_pages(memcg,
> -					batch_size, GFP_KERNEL, reclaim_options);
> +					batch_size, GFP_KERNEL,
> +					reclaim_options,
> +					swappiness == -1 ? NULL : &swappiness);
>  
>  		if (!reclaimed && !nr_retries--)
>  			return -EAGAIN;
> --- a/mm/vmscan.c~mm-add-swapiness=-arg-to-memoryreclaim
> +++ a/mm/vmscan.c
> @@ -92,6 +92,11 @@ struct scan_control {
>  	unsigned long	anon_cost;
>  	unsigned long	file_cost;
>  
> +#ifdef CONFIG_MEMCG
> +	/* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> +	int *proactive_swappiness;
> +#endif
> +
>  	/* Can active folios be deactivated as part of reclaim? */
>  #define DEACTIVATE_ANON 1
>  #define DEACTIVATE_FILE 2
> @@ -236,6 +241,13 @@ static bool writeback_throttling_sane(st
>  #endif
>  	return false;
>  }
> +
> +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> +{
> +	if (sc->proactive && sc->proactive_swappiness)
> +		return *sc->proactive_swappiness;
> +	return mem_cgroup_swappiness(memcg);
> +}
>  #else
>  static bool cgroup_reclaim(struct scan_control *sc)
>  {
> @@ -251,6 +263,11 @@ static bool writeback_throttling_sane(st
>  {
>  	return true;
>  }
> +
> +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> +{
> +	return READ_ONCE(vm_swappiness);
> +}
>  #endif
>  
>  static void set_task_reclaim_state(struct task_struct *task,
> @@ -2351,7 +2368,7 @@ static void get_scan_count(struct lruvec
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>  	unsigned long anon_cost, file_cost, total_cost;
> -	int swappiness = mem_cgroup_swappiness(memcg);
> +	int swappiness = sc_swappiness(sc, memcg);
>  	u64 fraction[ANON_AND_FILE];
>  	u64 denominator = 0;	/* gcc */
>  	enum scan_balance scan_balance;
> @@ -2632,7 +2649,7 @@ static int get_swappiness(struct lruvec
>  	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
>  		return 0;
>  
> -	return mem_cgroup_swappiness(memcg);
> +	return sc_swappiness(sc, memcg);
>  }
>  
>  static int get_nr_gens(struct lruvec *lruvec, int type)
> @@ -6549,12 +6566,14 @@ unsigned long mem_cgroup_shrink_node(str
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  					   unsigned long nr_pages,
>  					   gfp_t gfp_mask,
> -					   unsigned int reclaim_options)
> +					   unsigned int reclaim_options,
> +					   int *swappiness)
>  {
>  	unsigned long nr_reclaimed;
>  	unsigned int noreclaim_flag;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> +		.proactive_swappiness = swappiness,
>  		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
>  				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
>  		.reclaim_idx = MAX_NR_ZONES - 1,
> _
> 
> Patches currently in -mm which might be from schatzberg.dan@xxxxxxxxx are
> 
> 

-- 
Michal Hocko
SUSE Labs




[Index of Archives]     [Kernel Archive]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]

  Powered by Linux