Re: [PATCH v1 2/2] mm: zswap: zswap_store_pages() simplifications for batching.

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Mon, 2 Dec 2024 21:49:32 -0800

On Mon, Dec 2, 2024 at 8:19 PM Sridhar, Kanchana P
<kanchana.p.sridhar@xxxxxxxxx> wrote:
>
>
> > -----Original Message-----
> > From: Chengming Zhou <chengming.zhou@xxxxxxxxx>
> > Sent: Monday, December 2, 2024 7:06 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>; Yosry Ahmed
> > <yosryahmed@xxxxxxxxxx>
> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> > hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; usamaarif642@xxxxxxxxx;
> > ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> > Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh
> > <vinodh.gopal@xxxxxxxxx>
> > Subject: Re: [PATCH v1 2/2] mm: zswap: zswap_store_pages() simplifications
> > for batching.
> >
> > On 2024/12/3 09:01, Sridhar, Kanchana P wrote:
> > > Hi Chengming, Yosry,
> > >
> > >> -----Original Message-----
> > >> From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > >> Sent: Monday, December 2, 2024 11:33 AM
> > >> To: Chengming Zhou <chengming.zhou@xxxxxxxxx>
> > >> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>; linux-
> > >> kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; hannes@xxxxxxxxxxx;
> > >> nphamcs@xxxxxxxxx; usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx;
> > >> 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; Feghali, Wajdi K
> > >> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> > >> Subject: Re: [PATCH v1 2/2] mm: zswap: zswap_store_pages()
> > simplifications
> > >> for batching.
> > >>
> > >> On Wed, Nov 27, 2024 at 11:00 PM Chengming Zhou
> > >> <chengming.zhou@xxxxxxxxx> wrote:
> > >>>
> > >>> On 2024/11/28 06:53, Kanchana P Sridhar wrote:
> > >>>> In order to set up zswap_store_pages() to enable a clean batching
> > >>>> implementation in [1], this patch implements the following changes:
> > >>>>
> > >>>> 1) Addition of zswap_alloc_entries() which will allocate zswap entries for
> > >>>>      all pages in the specified range for the folio, upfront. If this fails,
> > >>>>      we return an error status to zswap_store().
> > >>>>
> > >>>> 2) Addition of zswap_compress_pages() that calls zswap_compress() for
> > >> each
> > >>>>      page, and returns false if any zswap_compress() fails, so
> > >>>>      zswap_store_page() can cleanup resources allocated and return an
> > >> error
> > >>>>      status to zswap_store().
> > >>>>
> > >>>> 3) A "store_pages_failed" label that is a catch-all for all failure points
> > >>>>      in zswap_store_pages(). This facilitates cleaner error handling within
> > >>>>      zswap_store_pages(), which will become important for IAA compress
> > >>>>      batching in [1].
> > >>>>
> > >>>> [1]: https://patchwork.kernel.org/project/linux-
> > mm/list/?series=911935
> > >>>>
> > >>>> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx>
> > >>>> ---
> > >>>>    mm/zswap.c | 93 +++++++++++++++++++++++++++++++++++++++++-
> > ---
> > >> ---------
> > >>>>    1 file changed, 71 insertions(+), 22 deletions(-)
> > >>>>
> > >>>> diff --git a/mm/zswap.c b/mm/zswap.c
> > >>>> index b09d1023e775..db80c66e2205 100644
> > >>>> --- a/mm/zswap.c
> > >>>> +++ b/mm/zswap.c
> > >>>> @@ -1409,9 +1409,56 @@ static void shrink_worker(struct work_struct
> > >> *w)
> > >>>>    * main API
> > >>>>    **********************************/
> > >>>>
> > >>>> +static bool zswap_compress_pages(struct page *pages[],
> > >>>> +                              struct zswap_entry *entries[],
> > >>>> +                              u8 nr_pages,
> > >>>> +                              struct zswap_pool *pool)
> > >>>> +{
> > >>>> +     u8 i;
> > >>>> +
> > >>>> +     for (i = 0; i < nr_pages; ++i) {
> > >>>> +             if (!zswap_compress(pages[i], entries[i], pool))
> > >>>> +                     return false;
> > >>>> +     }
> > >>>> +
> > >>>> +     return true;
> > >>>> +}
> > >>>
> > >>> How about introducing a `zswap_compress_folio()` interface which
> > >>> can be used by `zswap_store()`?
> > >>> ```
> > >>> zswap_store()
> > >>>          nr_pages = folio_nr_pages(folio)
> > >>>
> > >>>          entries = zswap_alloc_entries(nr_pages)
> > >>>
> > >>>          ret = zswap_compress_folio(folio, entries, pool)
> > >>>
> > >>>          // store entries into xarray and LRU list
> > >>> ```
> > >>>
> > >>> And this version `zswap_compress_folio()` is very simple for now:
> > >>> ```
> > >>> zswap_compress_folio()
> > >>>          nr_pages = folio_nr_pages(folio)
> > >>>
> > >>>          for (index = 0; index < nr_pages; ++index) {
> > >>>                  struct page *page = folio_page(folio, index);
> > >>>
> > >>>                  if (!zswap_compress(page, &entries[index], pool))
> > >>>                          return false;
> > >>>          }
> > >>>
> > >>>          return true;
> > >>> ```
> > >>> This can be easily extended to support your "batched" version.
> > >>>
> > >>> Then the old `zswap_store_page()` could be removed.
> > >>>
> > >>> The good point is simplicity, that we don't need to slice folio
> > >>> into multiple batches, then repeat the common operations for each
> > >>> batch, like preparing entries, storing into xarray and LRU list...
> > >>
> > >> +1
> > >
> > > Thanks for the code review comments. One question though: would
> > > it make sense to trade-off the memory footprint cost with the code
> > > simplification? For instance, lets say we want to store a 64k folio.
> > > We would allocate memory for 16 zswap entries, and lets say one of
> > > the compressions fails, we would deallocate memory for all 16 zswap
> > > entries. Could this lead to zswap_entry kmem_cache starvation and
> > > subsequent zswap_store() failures in multiple processes scenarios?
> >
> > Ah, I get your consideration. But it's the unlikely case, right?
> >
> > If the case you mentioned above happens a lot, I think yes, we should
> > optimize its memory footprint to avoid allocation and deallocation.
>
> Thanks Chengming. I see your point. Let me gather performance data
> for the two options, and share.

Yeah I think we shouldn't optimize for the uncommon case, not until
there's a real problem that needs fixing.

>
> >
> > On the other hand, we should consider a folio would be compressed
> > successfully in most cases. So we have to allocate all entries
> > eventually.
> >
> > Based on your consideration, I think your way is ok too, although
> > I think the patch 2/2 should be dropped, since it hides pages loop
> > in smaller functions, as Yosry mentioned too.
>
> My main intent with patch 2/2 was to set up the error handling
> path to be common and simpler, whether errors were encountered
> during compression/zpool_malloc/xarray store. Hence, I initialize the
> allocated zswap_entry's handle in zswap_alloc_entries() to ERR_PTR(-EINVAL),
> so it is easy for the common error handling code in patch 2 to determine
> if the handle was allocated (and hence needs to be freed). This benefits
> the batching code by eliminating the need to maintain state as to which
> stage of zswap_store_pages() sees an error, based on which resources
> would need to be deleted.
>
> My key consideration is to keep the batching error handling code simple,
> hence these changes in patch 2. The changes described above would
> help batching, and should not impact the non-batching case, as indicated
> by the regression testing data I've included in the cover letter.
>
> I don't mind inlining the implementation of the helper functions, as I
> mentioned in my response to Yosry. I am hoping the error handling
> simplifications are acceptable, since they will help the batching code.

I think having the loops open-coded should still be better than
separate helpers. But I understand not wanting to have the loops
directly in zswap_store(), as the error handling would be simpler if
we do it in a separate function like zswap_store_pages().

How about we just move the loop from  zswap_store() to
zswap_store_page() and call it zswap_store_folio()? When batching is
added I imagine we may need to split the loop into two loops before
and after zswap_compress_folio(), which isn't very neat but is
probably fine.