Re: [PATCH v2 mm-hotfixes] mm/zswap: fix inconsistent charging when zswap_store_page() fails

Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx> · Wed, 29 Jan 2025 14:48:33 +0900

On Wed, Jan 29, 2025 at 4:19 AM Yosry Ahmed <yosry.ahmed@xxxxxxxxx> wrote:
>
> On Tue, Jan 28, 2025 at 07:09:05PM +0000, Sridhar, Kanchana P wrote:
> > Hi Hyeonggon,
> >
> > > -----Original Message-----
> > > From: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx>
> > > Sent: Tuesday, January 28, 2025 10:55 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>; Johannes Weiner
> > > <hannes@xxxxxxxxxxx>; Yosry Ahmed <yosryahmed@xxxxxxxxxx>; Nhat
> > > Pham <nphamcs@xxxxxxxxx>; Chengming Zhou
> > > <chengming.zhou@xxxxxxxxx>; Andrew Morton <akpm@linux-
> > > foundation.org>
> > > Cc: linux-mm@xxxxxxxxx; Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx>;
> > > stable@xxxxxxxxxxxxxxx
> > > Subject: [PATCH v2 mm-hotfixes] mm/zswap: fix inconsistent charging when
> > > zswap_store_page() fails
> > >
> > > Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()")
> > > skips charging any zswapped base pages when it failed to zswap the entire
> > > folio.
> > >
> > > However, when some base pages are zswapped but it failed to zswap
> > > the entire folio, the zswap operation is rolled back.
> > > When freeing zswap entries for those pages, zswap_entry_free() uncharges
> > > the pages that were not previously charged, causing zswap charging to
> > > become inconsistent.
> > >
> > > This inconsistency triggers two warnings with following steps:
> > >   # On a machine with 64GiB of RAM and 36GiB of zswap
> > >   $ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng
> > >   $ sudo reboot
> > >
> > >   Two warnings are:
> > >     in mm/memcontrol.c:163, function obj_cgroup_release():
> > >       WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> > >
> > >     in mm/page_counter.c:60, function page_counter_cancel():
> > >       if (WARN_ONCE(new < 0, "page_counter underflow: %ld
> > > nr_pages=%lu\n",
> > >       new, nr_pages))
> > >
> > > While objcg events should only be accounted for when the entire folio is
> > > zswapped, objcg charging should be performed regardlessly.
> > > Fix accordingly.
> > >
> > > After resolving the inconsistency, these warnings disappear.
> > >
> > > Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()")
> > > Cc: stable@xxxxxxxxxxxxxxx
> > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx>
> > > ---
> > >
> > > v1->v2:
> > >
> > >  Fixed objcg events being accounted for on zswap failure.
> > >
> > >  Fixed the incorrect description. I misunderstood that the base pages are
> > >  going to be stored in zswap, but their zswap entries are freed immediately.
> > >
> > >  Added a comment on why it charges pages that are going to be removed
> > >  from zswap.
> > >
> > >  mm/zswap.c | 14 ++++++++++----
> > >  1 file changed, 10 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 6504174fbc6a..10b30ac46deb 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1568,20 +1568,26 @@ bool zswap_store(struct folio *folio)
> > >
> > >             bytes = zswap_store_page(page, objcg, pool);
> > >             if (bytes < 0)
> > > -                   goto put_pool;
> > > +                   goto charge_zswap;
> > >             compressed_bytes += bytes;
> > >     }
> > >
> > > -   if (objcg) {
> > > -           obj_cgroup_charge_zswap(objcg, compressed_bytes);
> > > +   if (objcg)
> > >             count_objcg_events(objcg, ZSWPOUT, nr_pages);
> > > -   }
> > >
> > >     atomic_long_add(nr_pages, &zswap_stored_pages);
> > >     count_vm_events(ZSWPOUT, nr_pages);
> > >
> > >     ret = true;
> > >
> > > +charge_zswap:
> > > +   /*
> > > +    * Charge zswapped pages even when it failed to zswap the entire
> > > folio,
> > > +    * because zswap_entry_free() will uncharge them anyway.
> > > +    * Otherwise zswap charging will become inconsistent.
> > > +    */
> > > +   if (objcg)
> > > +           obj_cgroup_charge_zswap(objcg, compressed_bytes);
> >
> > Thanks for finding this bug! I am thinking it might make sense to charge
> > and increment the zswap_stored_pages counter in zswap_store_page().
> > Something like:
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index b84c20d889b1..fd2a72598a8a 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1504,11 +1504,14 @@ static ssize_t zswap_store_page(struct page *page,
> >       entry->pool = pool;
> >       entry->swpentry = page_swpentry;
> >       entry->objcg = objcg;
> > +     if (objcg)
> > +             obj_cgroup_charge_zswap(objcg, entry->length);
> >       entry->referenced = true;
> >       if (entry->length) {
> >               INIT_LIST_HEAD(&entry->lru);
> >               zswap_lru_add(&zswap_list_lru, entry);
> >       }
> > +     atomic_long_inc(&zswap_stored_pages);
> >
> >       return entry->length;
> >
> > @@ -1526,7 +1529,6 @@ bool zswap_store(struct folio *folio)
> >       struct obj_cgroup *objcg = NULL;
> >       struct mem_cgroup *memcg = NULL;
> >       struct zswap_pool *pool;
> > -     size_t compressed_bytes = 0;
> >       bool ret = false;
> >       long index;
> >
> > @@ -1569,15 +1571,11 @@ bool zswap_store(struct folio *folio)
> >               bytes = zswap_store_page(page, objcg, pool);
> >               if (bytes < 0)
> >                       goto put_pool;
> > -             compressed_bytes += bytes;
> >       }
> >
> > -     if (objcg) {
> > -             obj_cgroup_charge_zswap(objcg, compressed_bytes);
> > +     if (objcg)
> >               count_objcg_events(objcg, ZSWPOUT, nr_pages);
> > -     }
> >
> > -     atomic_long_add(nr_pages, &zswap_stored_pages);
> >       count_vm_events(ZSWPOUT, nr_pages);
> >
> >       ret = true;
> >
> > What do you think?
> >
> > Yosry, Nhat, Johannes, please let me know if this would be a cleaner
> > approach. If so, I don't think we would be losing a lot of performance
> > by not doing the one-time charge per folio, but please let me know
> > your thoughts as well.
>
> This is certainly cleaner, and thanks for catching that
> zswap_stored_pages cleanup is also wrong.

Oh, yeah. Thanks for catching zswap_stored_pages.

> I am not sure if this has meaningful impact on performance, but it seems
> like we are doing a bit more work in the common success case to avoid
> the work in the uncommon failure case.

Right, but at the same time I think we need some evaluation to sacrifice
readability for performance. No need to be in a hurry to optimize when
fixing a bug, I think.

> Moving the charge (and atomic addition) above the zswap_store_page()
> loop would be doing the opposite, albeit less clean.

I think that wouldn't work because zswap won't uncharge for pages that
zswap_store_page() fails to store. And we don't know how many pages are
going to be stored in zswap in advance.

IMO v2 of this patch is efficient and it works while charging just to
uncharge doesn't look great.

> I don't feel strongly either way, but I slightly prefer the latter.

I'd prefer writing more readable code because it's a hotfix.
Let me post v3 with Sridhar's feedback (the former) adjusted!