On Thu, Nov 2, 2023 at 4:42 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > During our experiment with zswap, we sometimes observe swap IOs due to > occasional zswap store failures and writebacks-to-swap. These swapping > IOs prevent many users who cannot tolerate swapping from adopting zswap > to save memory and improve performance where possible. > > This patch adds the option to disable this behavior entirely: do not > writeback to backing swapping device when a zswap store attempt fail, > and do not write pages in the zswap pool back to the backing swap > device (both when the pool is full, and when the new zswap shrinker is > called). > > This new behavior can be opted-in/out on a per-cgroup basis via a new > cgroup file. By default, writebacks to swap device is enabled, which is > the previous behavior. > > Note that this is subtly different from setting memory.swap.max to 0, as > it still allows for pages to be stored in the zswap pool (which itself > consumes swap space in its current form). > > Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx> > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx> > --- > Documentation/admin-guide/cgroup-v2.rst | 11 +++++++ > Documentation/admin-guide/mm/zswap.rst | 6 ++++ > include/linux/memcontrol.h | 12 ++++++++ > include/linux/zswap.h | 6 ++++ > mm/memcontrol.c | 38 +++++++++++++++++++++++++ > mm/page_io.c | 6 ++++ > mm/shmem.c | 3 +- > mm/zswap.c | 14 +++++++++ > 8 files changed, 94 insertions(+), 2 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 606b2e0eac4b..18c4171392ea 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1672,6 +1672,17 @@ PAGE_SIZE multiple when read back. > limit, it will refuse to take any more stores before existing > entries fault back in or are written out to disk. > > + memory.zswap.writeback > + A read-write single value file which exists on non-root > + cgroups. The default value is "1". > + > + When this is set to 0, all swapping attempts to swapping devices > + are disabled. This included both zswap writebacks, and swapping due > + to zswap store failure. > + > + Note that this is subtly different from setting memory.swap.max to > + 0, as it still allows for pages to be written to the zswap pool. > + > memory.pressure > A read-only nested-keyed file. > > diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst > index 522ae22ccb84..b987e58edb70 100644 > --- a/Documentation/admin-guide/mm/zswap.rst > +++ b/Documentation/admin-guide/mm/zswap.rst > @@ -153,6 +153,12 @@ attribute, e. g.:: > > Setting this parameter to 100 will disable the hysteresis. > > +Some users cannot tolerate the swapping that comes with zswap store failures > +and zswap writebacks. Swapping can be disabled entirely (without disabling > +zswap itself) on a cgroup-basis as follows: > + > + echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback > + > When there is a sizable amount of cold memory residing in the zswap pool, it > can be advantageous to proactively write these cold pages to swap and reclaim > the memory for other use cases. By default, the zswap shrinker is disabled. > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 95f6c9e60ed1..e51eafdf2a15 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -219,6 +219,12 @@ struct mem_cgroup { > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > unsigned long zswap_max; > + > + /* > + * Prevent pages from this memcg from being written back from zswap to > + * swap, and from being swapped out on zswap store failures. > + */ > + bool zswap_writeback; > #endif > > unsigned long soft_limit; > @@ -1931,6 +1937,7 @@ static inline void count_objcg_event(struct obj_cgroup *objcg, > bool obj_cgroup_may_zswap(struct obj_cgroup *objcg); > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size); > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size); > +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg); > #else > static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > { > @@ -1944,6 +1951,11 @@ static inline void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, > size_t size) > { > } > +static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > +{ > + /* if zswap is disabled, do not block pages going to the swapping device */ > + return true; > +} > #endif > > #endif /* _LINUX_MEMCONTROL_H */ > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index cbd373ba88d2..b4997e27a74b 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -35,6 +35,7 @@ void zswap_swapoff(int type); > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg); > void zswap_lruvec_state_init(struct lruvec *lruvec); > void zswap_lruvec_swapin(struct page *page); > +bool is_zswap_enabled(void); > #else > > struct zswap_lruvec_state {}; > @@ -55,6 +56,11 @@ static inline void zswap_swapoff(int type) {} > static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {} > static inline void zswap_lruvec_init(struct lruvec *lruvec) {} > static inline void zswap_lruvec_swapin(struct page *page) {} > + > +static inline bool is_zswap_enabled(void) > +{ > + return false; > +} > #endif > > #endif /* _LINUX_ZSWAP_H */ > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e43b5aba8efc..8a6aadcc103c 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5545,6 +5545,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) > WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX); > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > memcg->zswap_max = PAGE_COUNTER_MAX; > + WRITE_ONCE(memcg->zswap_writeback, true); Generally LGTM, just one question. Would it be more convenient if the initial value is inherited from the parent (the root starts with true)? I can see this being useful if we want to set it to false on the entire machine or one a parent cgroup, we can set it before creating any children instead of setting it to 0 every time we create a new cgroup.