RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Wed, 25 Sep 2024 19:10:04 +0000

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> Sent: Wednesday, September 25, 2024 11:31 AM
> To: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>; linux-
> kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; nphamcs@xxxxxxxxx;
> chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx;
> shakeel.butt@xxxxxxxxx; ryan.roberts@xxxxxxx; Huang, Ying
> <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> [..]
> > > > +       /*
> > > > +        * Check cgroup limits:
> > > > +        *
> > > > +        * The cgroup zswap limit check is done once at the beginning of an
> > > > +        * mTHP store, and not within zswap_store_page() for each page
> > > > +        * in the mTHP. We do however check the zswap pool limits at the
> > > > +        * start of zswap_store_page(). What this means is, the cgroup
> > > > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > > > +        * However, the per-store-page zswap pool limits check should
> > > > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > > > +        * reclaim implemented in the shrinker. If this assumption holds,
> > > > +        * the cgroup exceeding the zswap limits could potentially be
> > > > +        * resolved before the next zswap_store, and if it is not, the next
> > > > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > > > +        */
> > >
> > > I do not really like this. Allowing going one page above the limit is
> > > one thing, but one THP above the limit seems too much. I also don't
> > > like relying on the repeated limit checking in zswap_store_page(), if
> > > anything I think that should be batched too.
> > >
> > > Is it too unreasonable to maintain the average compression ratio and
> > > use that to estimate limit checking for both memcg and global limits?
> > > Johannes, Nhat, any thoughts on this?
> >
> > I honestly don't think it's much of an issue. The global limit is
> > huge, and the cgroup limit is to the best of my knowledge only used as
> > a binary switch. Setting a non-binary limit - global or cgroup - seems
> > like a bit of an obscure usecase to me, because in the vast majority
> > of cases it's preferable to keep compresing over declaring OOM.
> >
> > And even if you do have some granular limit, the workload size scales
> > with it. It's not like you have a thousand THPs in a 10M cgroup.
> 
> The memcg limit and zswap limit can be disproportionate, although that
> shouldn't be common.
> 
> >
> > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > scheme: check the limit up front for fast-path failure if we're
> > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > atomically charge against zswap.max and unwind the store if we raced.
> >
> > For now, I would just keep the simple version we currently have: check
> > once in zswap_store() and then just go ahead for the whole folio.
> 
> I am not totally against this but I feel like this is too optimistic.
> I think we can keep it simple-ish by maintaining an ewma for the
> compression ratio, we already have primitives for this (see
> DECLARE_EWMA).
> 
> Then in zswap_store(), we can use the ewma to estimate the compressed
> size and use it to do the memcg and global limit checks once, like we
> do today. Instead of just checking if we are below the limits, we
> check if we have enough headroom for the estimated compressed size.
> Then we call zswap_store_page() to do the per-page stuff, then do
> batched charging and stats updates.
> 
> If you think that's an overkill we can keep doing the limit checks as
> we do today,
> but I would still like to see batching of all the limit checks,
> charging, and stats updates. It makes little sense otherwise.

Thanks Johannes and Yosry for these suggestions and pointers.
I think there is general agreement about the batch charging and
zswap_stored_pages/stats updates. Yosry,  does "batching of limit
checks" imply the same as a simple check for being over the cgroup
limit at the start of zswap_store and not doing this check in
zswap_store_page? Does this also imply a zswap_pool_get_many()?
Would appreciate it if you can help clarify.

The main question in my mind about using the EWMA checks is,
will it add overhead to the normal zswap reclaim path; and if so,
would a simple limit check at the start of zswap_store as suggested
by Johannes suffice. I can run a few experiments to quantify this
overhead, and maybe we can revisit this?

Thanks,
Kanchana