On Wed, Jan 3, 2024 at 9:49 AM Dan Schatzberg <schatzberg.dan@xxxxxxxxx> wrote: > > Allow proactive reclaimers to submit an additional swappiness=<val> > argument to memory.reclaim. This overrides the global or per-memcg > swappiness setting for that reclaim attempt. > > For example: > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > swap) regardless of the vm.swappiness sysctl setting. > > Userspace proactive reclaimers use the memory.reclaim interface to > trigger reclaim. The memory.reclaim interface does not allow for any way > to effect the balance of file vs anon during proactive reclaim. The only > approach is to adjust the vm.swappiness setting. However, there are a > few reasons we look to control the balance of file vs anon during > proactive reclaim, separately from reactive reclaim: > > * Swapout should be limited to manage SSD write endurance. In near-OOM > situations we are fine with lots of swap-out to avoid OOMs. As these are > typically rare events, they have relatively little impact on write > endurance. However, proactive reclaim runs continuously and so its > impact on SSD write endurance is more significant. Therefore it is > desireable to control swap-out for proactive reclaim separately from > reactive reclaim > > * Some userspace OOM killers like systemd-oomd[1] support OOM killing on > swap exhaustion. This makes sense if the swap exhaustion is triggered > due to reactive reclaim but less so if it is triggered due to proactive > reclaim (e.g. one could see OOMs when free memory is ample but anon is > just particularly cold). Therefore, it's desireable to have proactive > reclaim reduce or stop swap-out before the threshold at which OOM > killing occurs. > > In the case of Meta's Senpai proactive reclaimer, we adjust > vm.swappiness before writes to memory.reclaim[2]. This has been in > production for nearly two years and has addressed our needs to control > proactive vs reactive reclaim behavior but is still not ideal for a > number of reasons: > > * vm.swappiness is a global setting, adjusting it can race/interfere > with other system administration that wishes to control vm.swappiness. > In our case, we need to disable Senpai before adjusting vm.swappiness. > > * vm.swappiness is stateful - so a crash or restart of Senpai can leave > a misconfigured setting. This requires some additional management to > record the "desired" setting and ensure Senpai always adjusts to it. > > With this patch, we avoid these downsides of adjusting vm.swappiness > globally. > > [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html > [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 > > Signed-off-by: Dan Schatzberg <schatzberg.dan@xxxxxxxxx> > Suggested-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > Acked-by: Michal Hocko <mhocko@xxxxxxxx> > Acked-by: David Rientjes <rientjes@xxxxxxxxxx> > Acked-by: Chris Li <chrisl@xxxxxxxxxx> > --- > Documentation/admin-guide/cgroup-v2.rst | 18 ++++---- > include/linux/swap.h | 3 +- > mm/memcontrol.c | 56 ++++++++++++++++++++----- > mm/vmscan.c | 25 +++++++++-- > 4 files changed, 80 insertions(+), 22 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 3f85254f3cef..ee42f74e0765 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back. > This is a simple interface to trigger memory reclaim in the > target cgroup. > > - This file accepts a single key, the number of bytes to reclaim. > - No nested keys are currently supported. > - > Example:: > > echo "1G" > memory.reclaim > > - The interface can be later extended with nested keys to > - configure the reclaim behavior. For example, specify the > - type of memory to reclaim from (anon, file, ..). > - > Please note that the kernel can over or under reclaim from > the target cgroup. If less bytes are reclaimed than the > specified amount, -EAGAIN is returned. > @@ -1304,6 +1297,17 @@ PAGE_SIZE multiple when read back. > This means that the networking layer will not adapt based on > reclaim induced by memory.reclaim. > > +The following nested keys are defined. > + > + ========== ================================ > + swappiness Swappiness value to reclaim with > + ========== ================================ > + > + Specifying a swappiness value instructs the kernel to perform > + the reclaim with that swappiness value. Note that this has the > + same semantics as vm.swappiness applied to memcg reclaim with > + all the existing limitations and potential future extensions. > + > memory.peak > A read-only single value file which exists on non-root > cgroups. > diff --git a/include/linux/swap.h b/include/linux/swap.h > index e2ab76c25b4a..8afdec40efe3 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options); > + unsigned int reclaim_options, > + int *swappiness); > extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, > gfp_t gfp_mask, bool noswap, > pg_data_t *pgdat, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index fbe9f02dd206..6d627a754851 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -52,6 +52,7 @@ > #include <linux/sort.h> > #include <linux/fs.h> > #include <linux/seq_file.h> > +#include <linux/parser.h> > #include <linux/vmpressure.h> > #include <linux/memremap.h> > #include <linux/mm_inline.h> > @@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, > psi_memstall_enter(&pflags); > nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, > gfp_mask, > - MEMCG_RECLAIM_MAY_SWAP); > + MEMCG_RECLAIM_MAY_SWAP, > + NULL); > psi_memstall_leave(&pflags); > } while ((memcg = parent_mem_cgroup(memcg)) && > !mem_cgroup_is_root(memcg)); > @@ -2740,7 +2742,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > > psi_memstall_enter(&pflags); > nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, > - gfp_mask, reclaim_options); > + gfp_mask, reclaim_options, NULL); > psi_memstall_leave(&pflags); > > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > @@ -3660,7 +3662,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, > } > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { > + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) { > ret = -EBUSY; > break; > } > @@ -3774,7 +3776,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) > return -EINTR; > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - MEMCG_RECLAIM_MAY_SWAP)) > + MEMCG_RECLAIM_MAY_SWAP, NULL)) > nr_retries--; > } > > @@ -6720,7 +6722,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > } > > reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL); > > if (!reclaimed && !nr_retries--) > break; > @@ -6769,7 +6771,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, > > if (nr_reclaims) { > if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL)) > nr_reclaims--; > continue; > } > @@ -6895,19 +6897,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > return nbytes; > } > > +enum { > + MEMORY_RECLAIM_SWAPPINESS = 0, > + MEMORY_RECLAIM_NULL, > +}; > + > +static const match_table_t tokens = { > + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, > + { MEMORY_RECLAIM_NULL, NULL }, > +}; > + > static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > size_t nbytes, loff_t off) > { > struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > unsigned int nr_retries = MAX_RECLAIM_RETRIES; > unsigned long nr_to_reclaim, nr_reclaimed = 0; > + int swappiness = -1; > unsigned int reclaim_options; > - int err; > + char *old_buf, *start; > + substring_t args[MAX_OPT_ARGS]; > > buf = strstrip(buf); > - err = page_counter_memparse(buf, "", &nr_to_reclaim); > - if (err) > - return err; > + > + old_buf = buf; > + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; > + if (buf == old_buf) > + return -EINVAL; > + > + buf = strstrip(buf); > + > + while ((start = strsep(&buf, " ")) != NULL) { > + if (!strlen(start)) > + continue; > + switch (match_token(start, tokens, args)) { > + case MEMORY_RECLAIM_SWAPPINESS: > + if (match_int(&args[0], &swappiness)) > + return -EINVAL; > + if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS) > + return -EINVAL; > + break; > + default: > + return -EINVAL; > + } > + } > > reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; > while (nr_reclaimed < nr_to_reclaim) { > @@ -6926,7 +6959,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > - GFP_KERNEL, reclaim_options); > + GFP_KERNEL, reclaim_options, > + swappiness == -1 ? NULL : &swappiness); > > if (!reclaimed && !nr_retries--) > return -EAGAIN; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d91963e2d47f..394e0dd46b2e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -92,6 +92,11 @@ struct scan_control { > unsigned long anon_cost; > unsigned long file_cost; > > +#ifdef CONFIG_MEMCG > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ > + int *proactive_swappiness; > +#endif Why is proactive_swappiness still a pointer? The whole point of the previous conversation is that sc->proactive can tell whether sc->swappiness is valid or not, and that's less awkward than using a pointer. Also why the #ifdef here? I don't see the point for a small stack variable. Otherwise wouldn't we want to do this for sc->proactive as well? If you really want it to be explicit, you could do struct scan_control { ... struct { bool is_set; int swappiness; } proactive; }; But I think even this is too much.