Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Mon, 30 Sep 2024 16:19:57 -0700

On Mon, Sep 30, 2024 at 4:11 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Mon, Sep 30, 2024 at 3:12 PM Kanchana P Sridhar
> <kanchana.p.sridhar@xxxxxxxxx> wrote:
> >
> > zswap_store() will store large folios by compressing them page by page.
> >
> > This patch provides a sequential implementation of storing a large folio
> > in zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > zswap_store() calls the newly added zswap_store_page() function for each
> > page in the folio. zswap_store_page() handles compressing and storing each
> > page.
> >
> > We check the global and per-cgroup limits once at the beginning of
> > zswap_store(), and only check that the limit is not reached yet. This is
> > racy and inaccurate, but it should be sufficient for now. We also obtain
> > initial references to the relevant objcg and pool to guarantee that
> > subsequent references can be acquired by zswap_store_page(). A new function
> > zswap_pool_get() is added to facilitate this.
> >
> > If these one-time checks pass, we compress the pages of the folio, while
> > maintaining a running count of compressed bytes for all the folio's pages.
> > If all pages are successfully compressed and stored, we do the cgroup
> > zswap charging with the total compressed bytes, and batch update the
> > zswap_stored_pages atomic/zswpout event stats with folio_nr_pages() once,
> > before returning from zswap_store().
> >
> > If an error is encountered during the store of any page in the folio,
> > all pages in that folio currently stored in zswap will be invalidated.
> > Thus, a folio is either entirely stored in zswap, or entirely not stored
> > in zswap.
> >
> > The most important value provided by this patch is it enables swapping out
> > large folios to zswap without splitting them. Furthermore, it batches some
> > operations while doing so (cgroup charging, stats updates).
> >
> > This patch also forms the basis for building compress batching of pages in
> > a large folio in zswap_store() by compressing up to say, 8 pages of the
> > folio in parallel in hardware using the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@xxxxxxx/T/#u
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx>
> > ---
> >  mm/zswap.c | 220 +++++++++++++++++++++++++++++++++++++----------------
> >  1 file changed, 153 insertions(+), 67 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 2b8da50f6322..b74c8de99646 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -411,6 +411,12 @@ static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
> >         return percpu_ref_tryget(&pool->ref);
> >  }
> >
> > +/* The caller must already have a reference. */
> > +static void zswap_pool_get(struct zswap_pool *pool)
> > +{
> > +       percpu_ref_get(&pool->ref);
> > +}
> > +
> >  static void zswap_pool_put(struct zswap_pool *pool)
> >  {
> >         percpu_ref_put(&pool->ref);
> > @@ -1402,68 +1408,52 @@ static void shrink_worker(struct work_struct *w)
> >  /*********************************
> >  * main API
> >  **********************************/
> > -bool zswap_store(struct folio *folio)
> > +
> > +/*
> > + * Stores the page at specified "index" in a folio.
> > + *
> > + * @page:  The page to store in zswap.
> > + * @objcg: The folio's objcg. Caller has a reference.
> > + * @pool:  The zswap_pool to store the compressed data for the page.
> > + *         The caller should have obtained a reference to a valid
> > + *         zswap_pool by calling zswap_pool_tryget(), to pass as this
> > + *         argument.
> > + * @tree:  The xarray for the @page's folio's swap.
> > + * @compressed_bytes: The compressed entry->length value is added
> > + *                    to this, so that the caller can get the total
> > + *                    compressed lengths of all sub-pages in a folio.
> > + */
> > +static bool zswap_store_page(struct page *page,
> > +                            struct obj_cgroup *objcg,
> > +                            struct zswap_pool *pool,
> > +                            struct xarray *tree,
> > +                            size_t *compressed_bytes)
> >  {
> > -       swp_entry_t swp = folio->swap;
> > -       pgoff_t offset = swp_offset(swp);
> > -       struct xarray *tree = swap_zswap_tree(swp);
> >         struct zswap_entry *entry, *old;
> > -       struct obj_cgroup *objcg = NULL;
> > -       struct mem_cgroup *memcg = NULL;
> > -
> > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > -
> > -       /* Large folios aren't supported */
> > -       if (folio_test_large(folio))
> > -               return false;
> > -
> > -       if (!zswap_enabled)
> > -               goto check_old;
> > -
> > -       /* Check cgroup limits */
> > -       objcg = get_obj_cgroup_from_folio(folio);
> > -       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > -               memcg = get_mem_cgroup_from_objcg(objcg);
> > -               if (shrink_memcg(memcg)) {
> > -                       mem_cgroup_put(memcg);
> > -                       goto reject;
> > -               }
> > -               mem_cgroup_put(memcg);
> > -       }
> > -
> > -       if (zswap_check_limits())
> > -               goto reject;
> >
> >         /* allocate entry */
> > -       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > +       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(page_folio(page)));
> >         if (!entry) {
> >                 zswap_reject_kmemcache_fail++;
> >                 goto reject;
> >         }
> >
> > -       /* if entry is successfully added, it keeps the reference */
> > -       entry->pool = zswap_pool_current_get();
> > -       if (!entry->pool)
> > -               goto freepage;
> > +       /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> > +       if (objcg)
> > +               obj_cgroup_get(objcg);
> > +       zswap_pool_get(pool);
>
> Should we also batch-get references to the pool as well? i.e add a
> helper function:
>
> /* The caller must already have a reference. */
> static void zswap_pool_get_many(struct zswap_pool *pool, unsigned long nr)
> {
>        percpu_ref_get_many(&pool->ref, nr);
> }
>
> then do it in a fell swoop after you're done storing all individual subpages
> (near atomic_long_add(nr_pages, &zswap_stored_pages)).
>
> Do double check that it is safe - I think it should be, since we have
> the folio locked in swapcache, so there should not be any shenanigans
> (for e.g no race with concurrent free or writeback).
>
> Perhaps a fixlet suffices?

I suggested this in a previous version, and Kanchana faced some
complexities implementing it:
https://lore.kernel.org/lkml/SJ0PR11MB56785027ED6FCF673A84CEE6C96A2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

Basically, if we batch get the refs after the store I think it's not
safe, because once an entry is published to writeback it can be
written back and freed, and a ref that we never acquired would be
dropped.

Getting refs before the store would work, but then if the store fails
at an arbitrary page, we need to only drop refs on the pool for pages
that were not added to the tree, as the cleanup loop with
zswap_entry_free() at the end of zswap_store() will drop the ref for
those that were added to the tree.

We agreed to (potentially) do the batching for refcounts as a followup.