On Thu, Nov 2, 2023 at 1:28 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Thu, Nov 2, 2023 at 1:02 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > > > During our experiment with zswap, we sometimes observe swap IOs due to > > occasional zswap store failures and writebacks-to-swap. These swapping > > IOs prevent many users who cannot tolerate swapping from adopting zswap > > to save memory and improve performance where possible. > > > > This patch adds the option to disable this behavior entirely: do not > > writeback to backing swapping device when a zswap store attempt fail, > > and do not write pages in the zswap pool back to the backing swap > > device (both when the pool is full, and when the new zswap shrinker is > > called). > > > > This new behavior can be opted-in/out on a per-cgroup basis via a new > > cgroup file. By default, writebacks to swap device is enabled, which is > > the previous behavior. > > > > Note that this is subtly different from setting memory.swap.max to 0, as > > it still allows for pages to be stored in the zswap pool (which itself > > consumes swap space in its current form). > > > > Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx> > > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx> > > --- > > Documentation/admin-guide/cgroup-v2.rst | 11 +++++++ > > Documentation/admin-guide/mm/zswap.rst | 6 ++++ > > include/linux/memcontrol.h | 17 +++++++++++ > > mm/memcontrol.c | 38 +++++++++++++++++++++++++ > > mm/page_io.c | 6 ++++ > > mm/shmem.c | 3 +- > > mm/zswap.c | 9 ++++++ > > 7 files changed, 88 insertions(+), 2 deletions(-) > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > index 606b2e0eac4b..18c4171392ea 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1672,6 +1672,17 @@ PAGE_SIZE multiple when read back. > > limit, it will refuse to take any more stores before existing > > entries fault back in or are written out to disk. > > > > + memory.zswap.writeback > > + A read-write single value file which exists on non-root > > + cgroups. The default value is "1". > > + > > + When this is set to 0, all swapping attempts to swapping devices > > + are disabled. This included both zswap writebacks, and swapping due > > + to zswap store failure. > > + > > + Note that this is subtly different from setting memory.swap.max to > > + 0, as it still allows for pages to be written to the zswap pool. > > + > > memory.pressure > > A read-only nested-keyed file. > > > > diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst > > index 522ae22ccb84..b987e58edb70 100644 > > --- a/Documentation/admin-guide/mm/zswap.rst > > +++ b/Documentation/admin-guide/mm/zswap.rst > > @@ -153,6 +153,12 @@ attribute, e. g.:: > > > > Setting this parameter to 100 will disable the hysteresis. > > > > +Some users cannot tolerate the swapping that comes with zswap store failures > > +and zswap writebacks. Swapping can be disabled entirely (without disabling > > +zswap itself) on a cgroup-basis as follows: > > + > > + echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback > > + > > When there is a sizable amount of cold memory residing in the zswap pool, it > > can be advantageous to proactively write these cold pages to swap and reclaim > > the memory for other use cases. By default, the zswap shrinker is disabled. > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 95f6c9e60ed1..e3a3a06727dc 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -219,6 +219,12 @@ struct mem_cgroup { > > > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > > unsigned long zswap_max; > > + > > + /* > > + * Prevent pages from this memcg from being written back from zswap to > > + * swap, and from being swapped out on zswap store failures. > > + */ > > + bool zswap_writeback; > > #endif > > > > unsigned long soft_limit; > > @@ -1615,6 +1621,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > > { > > return 0; > > } > > + > > +static inline bool mem_cgroup_swap_disk_enabled(struct mem_cgroup *memcg) > > +{ > > + return false; > > +} > > + > > This seems to be a leftover from a prior version. 100%! Thanks for picking that up. Not sure why I didn't see it with my grep and ctrl F. I'll send a fixlet/new version to remove this later. It nominally depends on the shrinker series (which is currently under cleanups - thanks for the suggestions over there too, Yosry) so no rush in merging this in. Just wanna send this out early to hear people's feedback about the implementation + naming. > > > #endif /* CONFIG_MEMCG */ > > > > static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx) > > @@ -1931,6 +1943,7 @@ static inline void count_objcg_event(struct obj_cgroup *objcg, > > bool obj_cgroup_may_zswap(struct obj_cgroup *objcg); > > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size); > > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size); > > +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg); > > #else > > static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > { > > @@ -1944,6 +1957,10 @@ static inline void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, > > size_t size) > > { > > } > > +static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > > +{ > > + return false; > > +} > > #endif > > > > #endif /* _LINUX_MEMCONTROL_H */ > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index e43b5aba8efc..b68c613c23a9 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -5545,6 +5545,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) > > WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX); > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > > memcg->zswap_max = PAGE_COUNTER_MAX; > > + WRITE_ONCE(memcg->zswap_writeback, true); > > #endif > > page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); > > if (parent) { > > @@ -8177,6 +8178,12 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size) > > rcu_read_unlock(); > > } > > > > +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > > +{ > > + return cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg > > + && READ_ONCE(memcg->zswap_writeback); > > +} > > + > > static u64 zswap_current_read(struct cgroup_subsys_state *css, > > struct cftype *cft) > > { > > @@ -8209,6 +8216,31 @@ static ssize_t zswap_max_write(struct kernfs_open_file *of, > > return nbytes; > > } > > > > +static int zswap_writeback_show(struct seq_file *m, void *v) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > > + > > + seq_printf(m, "%d\n", READ_ONCE(memcg->zswap_writeback)); > > + return 0; > > +} > > + > > +static ssize_t zswap_writeback_write(struct kernfs_open_file *of, > > + char *buf, size_t nbytes, loff_t off) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > + int zswap_writeback; > > + ssize_t parse_ret = kstrtoint(strstrip(buf), 0, &zswap_writeback); > > + > > + if (parse_ret) > > + return parse_ret; > > + > > + if (zswap_writeback != 0 && zswap_writeback != 1) > > + return -EINVAL; > > + > > + WRITE_ONCE(memcg->zswap_writeback, zswap_writeback); > > + return nbytes; > > +} > > + > > static struct cftype zswap_files[] = { > > { > > .name = "zswap.current", > > @@ -8221,6 +8253,12 @@ static struct cftype zswap_files[] = { > > .seq_show = zswap_max_show, > > .write = zswap_max_write, > > }, > > + { > > + .name = "zswap.writeback", > > + .flags = CFTYPE_NOT_ON_ROOT, > > + .seq_show = zswap_writeback_show, > > + .write = zswap_writeback_write, > > + }, > > { } /* terminate */ > > }; > > #endif /* CONFIG_MEMCG_KMEM && CONFIG_ZSWAP */ > > diff --git a/mm/page_io.c b/mm/page_io.c > > index cb559ae324c6..5e606f1aa2f6 100644 > > --- a/mm/page_io.c > > +++ b/mm/page_io.c > > @@ -201,6 +201,12 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) > > folio_end_writeback(folio); > > return 0; > > } > > + > > + if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) { > > + folio_mark_dirty(folio); > > + return AOP_WRITEPAGE_ACTIVATE; > > + } > > + > > I am not a fan of this, because it will disable using disk swap if > "zswap_writeback" is disabled, even if zswap is disabled or the page > was never in zswap. The term zswap_writeback makes no sense here tbh. > > I am still hoping someone else will suggest better semantics, because > honestly I can't think of anything. Perhaps something like > memory.swap.zswap_only or memory.swap.types which accepts a string > (e.g. "zswap"/"all",..). > > Don't take my suggestions strongly because I am not very fond of them. > > Can anyone else come back with better naming/semantics for "use zswap > but nothing else when swapping"? I discussed this a little bit with Johannes, and decided to go with memory.zswap.writeback because zswap store failure is quite rare in practice (and will become even rarer once I have the time to get the storing compressed objects to happen). But still, I'm happy to hear any other naming suggestions! This (+ the shrinker dependency) is why I send this as an RFC anyway. > > > __swap_writepage(&folio->page, wbc); > > return 0; > > } > > diff --git a/mm/shmem.c b/mm/shmem.c > > index cab053831fea..e5044678de8b 100644 > > --- a/mm/shmem.c > > +++ b/mm/shmem.c > > @@ -1514,8 +1514,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) > > > > mutex_unlock(&shmem_swaplist_mutex); > > BUG_ON(folio_mapped(folio)); > > - swap_writepage(&folio->page, wbc); > > - return 0; > > + return swap_writepage(&folio->page, wbc); > > } > > > > mutex_unlock(&shmem_swaplist_mutex); > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 260e01180ee0..42a478d1a21f 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -590,6 +590,9 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, > > struct zswap_pool *pool = shrinker->private_data; > > bool encountered_page_in_swapcache = false; > > > > + if (!mem_cgroup_zswap_writeback_enabled(sc->memcg)) > > + return SHRINK_STOP; > > + > > nr_protected = > > atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected); > > lru_size = list_lru_shrink_count(&pool->list_lru, sc); > > @@ -620,6 +623,9 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, > > struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid)); > > unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; > > > > + if (!mem_cgroup_zswap_writeback_enabled(memcg)) > > + return 0; > > + > > #ifdef CONFIG_MEMCG_KMEM > > cgroup_rstat_flush(memcg->css.cgroup); > > nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; > > @@ -935,6 +941,9 @@ static int shrink_memcg(struct mem_cgroup *memcg) > > struct zswap_pool *pool; > > int nid, shrunk = 0; > > > > + if (!mem_cgroup_zswap_writeback_enabled(memcg)) > > + return -EINVAL; > > + > > /* > > * Skip zombies because their LRUs are reparented and we would be > > * reclaiming from the parent instead of the dead memcg. > > -- > > 2.34.1