Re: [PATCH] zswap: do not shrink when memory.zswap.max is 0

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Tue, 30 May 2023 11:41:32 -0700

On Tue, May 30, 2023 at 11:00 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Tue, May 30, 2023 at 09:52:36AM -0700, Yosry Ahmed wrote:
> > On Tue, May 30, 2023 at 9:22 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> > >
> > > Before storing a page, zswap first checks if the number of stored pages
> > > exceeds the limit specified by memory.zswap.max, for each cgroup in the
> > > hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> > > triggered and short-circuits the store attempt.
> > >
> > > However, if memory.zswap.max = 0 for a cgroup, no amount of writeback
> > > will allow future store attempts from processes in this cgroup to
> > > succeed. Furthermore, this create a pathological behavior in a system
> > > where some cgroups have memory.zswap.max = 0 and some do not: the
> > > processes in the former cgroups, under memory pressure, will evict pages
> > > stored by the latter continually, until the need for swap ceases or the
> > > pool becomes empty.
> > >
> > > As a result of this, we observe a disproportionate amount of zswap
> > > writeback and a perpetually small zswap pool in our experiments, even
> > > though the pool limit is never hit.
> > >
> > > This patch fixes the issue by rejecting zswap store attempt without
> > > shrinking the pool when memory.zswap.max is 0.
> > >
> > > Fixes: f4840ccfca25 ("zswap: memcg accounting")
> > > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx>
> > > ---
> > >  include/linux/memcontrol.h | 6 +++---
> > >  mm/memcontrol.c            | 8 ++++----
> > >  mm/zswap.c                 | 9 +++++++--
> > >  3 files changed, 14 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 222d7370134c..507bed3a28b0 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> > >  #endif /* CONFIG_MEMCG_KMEM */
> > >
> > >  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > >  void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
> > >  void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> > >  #else
> > > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > >  {
> > > -       return true;
> > > +       return 0;
> > >  }
> > >  static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
> > >                                            size_t size)
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 4b27e245a055..09aad0e6f2ea 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = {
> > >   * spending cycles on compression when there is already no room left
> > >   * or zswap is disabled altogether somewhere in the hierarchy.
> > >   */
> > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > >  {
> > >         struct mem_cgroup *memcg, *original_memcg;
> > > -       bool ret = true;
> > > +       int ret = 0;
> > >
> > >         if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > >                 return true;
> > > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > >                 if (max == PAGE_COUNTER_MAX)
> > >                         continue;
> > >                 if (max == 0) {
> > > -                       ret = false;
> > > +                       ret = -ENODEV;
> > >                         break;
> > >                 }
> > >
> > > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > >                 pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
> > >                 if (pages < max)
> > >                         continue;
> > > -               ret = false;
> > > +               ret = -ENOMEM;
> > >                 break;
> > >         }
> > >         mem_cgroup_put(original_memcg);
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 59da2a415fbb..7b13dc865438 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > >         }
> > >
> > >         objcg = get_obj_cgroup_from_page(page);
> > > -       if (objcg && !obj_cgroup_may_zswap(objcg))
> > > -               goto shrink;
> > > +       if (objcg) {
> > > +               ret = obj_cgroup_may_zswap(objcg);
> > > +               if (ret == -ENODEV)
> > > +                       goto reject;
> > > +               if (ret == -ENOMEM)
> > > +                       goto shrink;
> > > +       }
> >
> > I wonder if we should just make this:
> >
> > if (objcg && !obj_cgroup_may_zswap(objcg))
> >         goto reject;
> >
> > Even if memory.zswap.max is > 0, if the limit is hit, shrinking the
> > zswap pool will only help if we happen to writeback a page from the
> > same memcg that hit its limit. Keep in mind that we will only
> > writeback one page every time we observe that the limit is hit (even
> > with Domenico's patch, because zswap_can_accept() should be true).
> >
> > On a system with a handful of memcgs,
> > it seems likely that we wrongfully writeback pages from other memcgs
> > because of this. Achieving nothing for this memcg, while hurting
> > others. OTOH, without invoking writeback when the limit is hit, the
> > memcg will just not be able to use zswap until some pages are
> > faulted back in or invalidated.
> >
> > I am not sure which is better, just thinking out loud.
>
> You're absolutely right.
>
> Currently the choice is writing back either everybody or nobody,
> meaning between writeback and cgroup containment. They're both so poor
> that I can't say I strongly prefer one over the other.
>
> However, I have a lame argument in favor of this patch:
>
> The last few fixes from Nhat and Domenico around writeback show that
> few people, if anybody, are actually using writeback. So it might not
> actually matter that much in practice which way we go with this patch.
> Per-memcg LRUs will be necessary for it to work right.
>
> However, what Nhat is proposing is how we want the behavior down the
> line. So between two equally poor choices, I figure we might as well
> go with the one that doesn't require another code change later on.
>
> Doesn't that fill you with radiant enthusiasm?

If we have per-memcg LRUs, and memory.zswap.max == 0, then we should
be in one of two situations:

(a) memory.zswap.max has always been 0, so the LRU for this memcg is
empty, so we don't really need the special case for memory.zswap.max
== 0.

(b) memory.zswap.max was reduced to 0 at some point, and some pages
are already in zswap. In this case, I don't think shrinking the memcg
is such a bad idea, we would be lazily enforcing the limit.

In that sense I am not sure that this change won't require another
code change. It feels like special casing memory.zswap.max == 0 is
only needed now due to the lack of per-memcg LRUs.

>
> > Seems like this can be solved by having per-memcg LRUs, or at least
> > providing an argument to the shrinker of which memcg to reclaim from.
> > This would only be possible when the LRU is moved to zswap.
>
> +1