On Wed, Dec 20, 2023 at 8:27 AM Dan Schatzberg <schatzberg.dan@xxxxxxxxx> wrote: > > Allow proactive reclaimers to submit an additional swappiness=<val> > argument to memory.reclaim. This overrides the global or per-memcg > swappiness setting for that reclaim attempt. > > For example: > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > swap) regardless of the vm.swappiness sysctl setting. > > Userspace proactive reclaimers use the memory.reclaim interface to > trigger reclaim. The memory.reclaim interface does not allow for any way > to effect the balance of file vs anon during proactive reclaim. The only > approach is to adjust the vm.swappiness setting. However, there are a > few reasons we look to control the balance of file vs anon during > proactive reclaim, separately from reactive reclaim: > > * Swapout should be limited to manage SSD write endurance. In near-OOM > situations we are fine with lots of swap-out to avoid OOMs. As these are > typically rare events, they have relatively little impact on write > endurance. However, proactive reclaim runs continuously and so its > impact on SSD write endurance is more significant. Therefore it is > desireable to control swap-out for proactive reclaim separately from > reactive reclaim > > * Some userspace OOM killers like systemd-oomd[1] support OOM killing on > swap exhaustion. This makes sense if the swap exhaustion is triggered > due to reactive reclaim but less so if it is triggered due to proactive > reclaim (e.g. one could see OOMs when free memory is ample but anon is > just particularly cold). Therefore, it's desireable to have proactive > reclaim reduce or stop swap-out before the threshold at which OOM > killing occurs. > > In the case of Meta's Senpai proactive reclaimer, we adjust > vm.swappiness before writes to memory.reclaim[2]. This has been in > production for nearly two years and has addressed our needs to control > proactive vs reactive reclaim behavior but is still not ideal for a > number of reasons: > > * vm.swappiness is a global setting, adjusting it can race/interfere > with other system administration that wishes to control vm.swappiness. > In our case, we need to disable Senpai before adjusting vm.swappiness. > > * vm.swappiness is stateful - so a crash or restart of Senpai can leave > a misconfigured setting. This requires some additional management to > record the "desired" setting and ensure Senpai always adjusts to it. > > With this patch, we avoid these downsides of adjusting vm.swappiness > globally. > > [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html > [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 > > Signed-off-by: Dan Schatzberg <schatzberg.dan@xxxxxxxxx> The cover letter says: "Previously, this exact interface addition was proposed by Yosry[3]." So I think it should be acknowledged with a Suggested-by, based on: "A Suggested-by: tag indicates that the patch idea is suggested by the person named and ensures credit to the person for the idea." from https://docs.kernel.org/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes > Documentation/admin-guide/cgroup-v2.rst | 18 ++++---- > include/linux/swap.h | 3 +- > mm/memcontrol.c | 56 ++++++++++++++++++++----- > mm/vmscan.c | 13 +++++- > 4 files changed, 69 insertions(+), 21 deletions(-) ... > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d91963e2d47f..aa5666842c49 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -92,6 +92,9 @@ struct scan_control { > unsigned long anon_cost; > unsigned long file_cost; > > + /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */ > + int *swappiness; Using a pointer to indicate whether the type it points to is overridden isn't really a good practice. A better alternative was suggested during the v2: "Perhaps the negative to avoid unnecessary dereferences." https://lore.kernel.org/linux-mm/dhhjw4h22q4ngwtxmhuyifv32zjd6z2relrcjgnxsw6zys3mod@o6dh5dy53ae3/ Since only proactive reclaim can override swappiness, meaning it only happens if sc->proactive is true, I think the best way to make it work without spending much effort is create a helper as Michal suggest but it should look like: sc_swappiness() { return sc->proactive ? sc->swappiness : mem_cgroup_swappiness(memcg); } In this patchset, sc->swappiness really means sc->proactive_swappiness. So it should be renamed accordingly.