RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Thu, 26 Sep 2024 03:58:20 +0000

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Sent: Wednesday, September 25, 2024 3:29 PM
> To: Yosry Ahmed <yosryahmed@xxxxxxxxxx>; Johannes Weiner
> <hannes@xxxxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx;
> usamaarif642@xxxxxxxxx; shakeel.butt@xxxxxxxxx; ryan.roberts@xxxxxxx;
> Huang, Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>;
> Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Subject: RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > Sent: Wednesday, September 25, 2024 2:06 PM
> > To: Johannes Weiner <hannes@xxxxxxxxxxx>
> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>; linux-
> > kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; nphamcs@xxxxxxxxx;
> > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx;
> > shakeel.butt@xxxxxxxxx; ryan.roberts@xxxxxxx; Huang, Ying
> > <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux-
> foundation.org;
> > Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K
> > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > zswap_store().
> >
> > On Wed, Sep 25, 2024 at 1:13 PM Johannes Weiner
> <hannes@xxxxxxxxxxx>
> > wrote:
> > >
> > > On Wed, Sep 25, 2024 at 12:39:02PM -0700, Yosry Ahmed wrote:
> > > > On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner
> > <hannes@xxxxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> > > > > > Johannes wrote:
> > > > > > > If this ever becomes an issue, we can handle it in a fastpath-
> > slowpath
> > > > > > > scheme: check the limit up front for fast-path failure if we're
> > > > > > > already maxed out, just like now; then make
> > obj_cgroup_charge_zswap()
> > > > > > > atomically charge against zswap.max and unwind the store if we
> > raced.
> > > > > > >
> > > > > > > For now, I would just keep the simple version we currently have:
> > check
> > > > > > > once in zswap_store() and then just go ahead for the whole folio.
> > > > > >
> > > > > > I am not totally against this but I feel like this is too optimistic.
> > > > > > I think we can keep it simple-ish by maintaining an ewma for the
> > > > > > compression ratio, we already have primitives for this (see
> > > > > > DECLARE_EWMA).
> > > > > >
> > > > > > Then in zswap_store(), we can use the ewma to estimate the
> > compressed
> > > > > > size and use it to do the memcg and global limit checks once, like we
> > > > > > do today. Instead of just checking if we are below the limits, we
> > > > > > check if we have enough headroom for the estimated compressed
> size.
> > > > > > Then we call zswap_store_page() to do the per-page stuff, then do
> > > > > > batched charging and stats updates.
> > > > >
> > > > > I'm not sure what you gain from making a non-atomic check precise.
> You
> > > > > can get a hundred threads determining down precisely that *their*
> > > > > store will fit exactly into the last 800kB before the limit.
> > > >
> > > > We just get to avoid overshooting in cases where we know we probably
> > > > can't fit it anyway. If we have 4KB left and we are trying to compress
> > > > a 2MB THP, for example. It just makes the upfront check to avoid
> > > > pointless compression a little bit more meaningful.
> > >
> > > I think I'm missing something. It's not just an upfront check, it's
> > > the only check. The charge down the line doesn't limit anything, it
> > > just counts. So if this check passes, we WILL store the folio. There
> > > is no pointless compression.
> >
> > I got confused by what you said about the fast-slow path, I thought
> > you were suggesting we do this now, so I was saying it's better to use
> > an estimate of the compressed size in the fast path to avoid pointless
> > compression.
> >
> > I missed the second paragraph.
> >
> > >
> > > We might overshoot the limit by about one folio in a single-threaded
> > > scenario. But that is negligible in comparison to the overshoot we can
> > > get due to race conditions.
> > >
> > > Again, I see no no practical, meaningful difference in outcome by
> > > making that limit check any more precise. Just keep it as-is.
> >
> > > Sorry to be blunt, but "precision" in a non-atomic check like this?
> > > makes no sense. The fact that it's not too expensive is irrelevant.
> > > This discussion around this honestly has gone off the rails.
> >
> > Yeah I thought we were talking about the version where we rollback
> > compressions if we overshoot, my bad. We discussed quite a few things
> > and I managed to confuse myself.
> >
> > > Just leave the limit checks exactly as they are. Check limits and
> > > cgroup_may_zswap() once up front. Compress the subpages. Acquire
> > > references and bump all stats in batches of folio_nr_pages(). You can
> > > add up the subpage compressed bytes in the for-loop and do the
> > > obj_cgroup_charge_zswap() in a single call at the end as well.
> >
> > We can keep the limit checks as they are for now, and revisit as needed.
> 
> Thanks Johannes and Yosry for the discussion! I will proceed as suggested.

One thing I realized while reworking the patches for the batched checks is:
within zswap_store_page(), we set the entry->objcg and entry->pool before
adding it to the xarray. Given this, wouldn't it be safer to get the objcg
and pool reference per sub-page, locally in zswap_store_page(), rather than
obtaining batched references at the end if the store is successful? If we want
zswap_store_page() to be self-contained and correct as far as the entry
being created and added to the xarray, it seems like the right thing to do?
I am a bit apprehensive about the entry being added to the xarray without
a reference obtained on the objcg and pool, because any page-faults/writeback
that occur on sub-pages added to the xarray before the entire folio has been
stored, would run into issues.

Just wanted to run this by you. The rest of the batched charging, atomic
and stat updates should be Ok.

Thanks,
Kanchana

> 
> Thanks,
> Kanchana