On Tue, May 30, 2023 at 11:27 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > On Tue, May 30, 2023 at 9:53 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > On Tue, May 30, 2023 at 9:22 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > > > > > Before storing a page, zswap first checks if the number of stored pages > > > exceeds the limit specified by memory.zswap.max, for each cgroup in the > > > hierarchy. If this limit is reached or exceeded, then zswap shrinking is > > > triggered and short-circuits the store attempt. > > > > > > However, if memory.zswap.max = 0 for a cgroup, no amount of writeback > > > will allow future store attempts from processes in this cgroup to > > > succeed. Furthermore, this create a pathological behavior in a system > > > where some cgroups have memory.zswap.max = 0 and some do not: the > > > processes in the former cgroups, under memory pressure, will evict pages > > > stored by the latter continually, until the need for swap ceases or the > > > pool becomes empty. > > > > > > As a result of this, we observe a disproportionate amount of zswap > > > writeback and a perpetually small zswap pool in our experiments, even > > > though the pool limit is never hit. > > > > > > This patch fixes the issue by rejecting zswap store attempt without > > > shrinking the pool when memory.zswap.max is 0. > > > > > > Fixes: f4840ccfca25 ("zswap: memcg accounting") > > > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx> > > > --- > > > include/linux/memcontrol.h | 6 +++--- > > > mm/memcontrol.c | 8 ++++---- > > > mm/zswap.c | 9 +++++++-- > > > 3 files changed, 14 insertions(+), 9 deletions(-) > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > index 222d7370134c..507bed3a28b0 100644 > > > --- a/include/linux/memcontrol.h > > > +++ b/include/linux/memcontrol.h > > > @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg, > > > #endif /* CONFIG_MEMCG_KMEM */ > > > > > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg); > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg); > > > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size); > > > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size); > > > #else > > > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > { > > > - return true; > > > + return 0; > > > } > > > static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, > > > size_t size) > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 4b27e245a055..09aad0e6f2ea 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = { > > > * spending cycles on compression when there is already no room left > > > * or zswap is disabled altogether somewhere in the hierarchy. > > > */ > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > { > > > struct mem_cgroup *memcg, *original_memcg; > > > - bool ret = true; > > > + int ret = 0; > > > > > > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > > > return true; > > > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > if (max == PAGE_COUNTER_MAX) > > > continue; > > > if (max == 0) { > > > - ret = false; > > > + ret = -ENODEV; > > > break; > > > } > > > > > > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE; > > > if (pages < max) > > > continue; > > > - ret = false; > > > + ret = -ENOMEM; > > > break; > > > } > > > mem_cgroup_put(original_memcg); > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index 59da2a415fbb..7b13dc865438 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > > > } > > > > > > objcg = get_obj_cgroup_from_page(page); > > > - if (objcg && !obj_cgroup_may_zswap(objcg)) > > > - goto shrink; > > > + if (objcg) { > > > + ret = obj_cgroup_may_zswap(objcg); > > > + if (ret == -ENODEV) > > > + goto reject; > > > + if (ret == -ENOMEM) > > > + goto shrink; > > > + } > > > > I wonder if we should just make this: > > > > if (objcg && !obj_cgroup_may_zswap(objcg)) > > goto reject; > > > > Even if memory.zswap.max is > 0, if the limit is hit, shrinking the > > zswap pool will only help if we happen to writeback a page from the > > same memcg that hit its limit. Keep in mind that we will only > > writeback one page every time we observe that the limit is hit (even > > with Domenico's patch, because zswap_can_accept() should be true). > > > > On a system with a handful of memcgs, > > it seems likely that we wrongfully writeback pages from other memcgs > > because of this. Achieving nothing for this memcg, while hurting > > others. OTOH, without invoking writeback when the limit is hit, the > > memcg will just not be able to use zswap until some pages are > > faulted back in or invalidated. > > > > I am not sure which is better, just thinking out loud. > > > > Seems like this can be solved by having per-memcg LRUs, or at least > > providing an argument to the shrinker of which memcg to reclaim from. > > This would only be possible when the LRU is moved to zswap. > > I totally agree! This seems like the logical next step in zswap's evolution. > I actually proposed this fix with this future development in mind - with > a per-memcg LRU, we can trigger memcg-specific shrinking in > place of this indiscriminate writeback. It seems less drastic a change > (compared to removing shrinking here now, then reintroducing it later). As I stated in my reply to Johannes, I am just not sure that we will need to special case memory.zswap.max == 0 when we have proper writeback. WDYT? > > Thanks for the feedback, Yosry! > > > > > > > > > > > /* reclaim space if needed */ > > > if (zswap_is_full()) { > > > -- > > > 2.34.1 > > > > > >