RE: [RFC PATCH v1 00/13] zswap IAA compress batching

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Wed, 23 Oct 2024 20:34:05 +0000

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> Sent: Wednesday, October 23, 2024 11:16 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx;
> usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; Huang, Ying
> <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> linux-crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx;
> davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> <kristen.c.accardi@xxxxxxxxx>; zanussi@xxxxxxxxxx; viro@xxxxxxxxxxxxxxxxxx;
> brauner@xxxxxxxxxx; jack@xxxxxxx; mcgrof@xxxxxxxxxx; kees@xxxxxxxxxx;
> joel.granados@xxxxxxxxxx; bfoster@xxxxxxxxxx; willy@xxxxxxxxxxxxx; linux-
> fsdevel@xxxxxxxxxxxxxxx; Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal,
> Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [RFC PATCH v1 00/13] zswap IAA compress batching
> 
> On Tue, Oct 22, 2024 at 7:53 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@xxxxxxxxx> wrote:
> >
> > Hi Yosry,
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > > Sent: Tuesday, October 22, 2024 5:57 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> > > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> > > hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx;
> chengming.zhou@xxxxxxxxx;
> > > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; Huang, Ying
> > > <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux-
> foundation.org;
> > > linux-crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx;
> > > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> > > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> > > <kristen.c.accardi@xxxxxxxxx>; zanussi@xxxxxxxxxx;
> viro@xxxxxxxxxxxxxxxxxx;
> > > brauner@xxxxxxxxxx; jack@xxxxxxx; mcgrof@xxxxxxxxxx;
> kees@xxxxxxxxxx;
> > > joel.granados@xxxxxxxxxx; bfoster@xxxxxxxxxx; willy@xxxxxxxxxxxxx;
> linux-
> > > fsdevel@xxxxxxxxxxxxxxx; Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>;
> Gopal,
> > > Vinodh <vinodh.gopal@xxxxxxxxx>
> > > Subject: Re: [RFC PATCH v1 00/13] zswap IAA compress batching
> > >
> > > On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
> > > <kanchana.p.sridhar@xxxxxxxxx> wrote:
> > > >
> > > >
> > > > IAA Compression Batching:
> > > > =========================
> > > >
> > > > This RFC patch-series introduces the use of the Intel Analytics
> Accelerator
> > > > (IAA) for parallel compression of pages in a folio, and for batched reclaim
> > > > of hybrid any-order batches of folios in shrink_folio_list().
> > > >
> > > > The patch-series is organized as follows:
> > > >
> > > >  1) iaa_crypto driver enablers for batching: Relevant patches are tagged
> > > >     with "crypto:" in the subject:
> > > >
> > > >     a) async poll crypto_acomp interface without interrupts.
> > > >     b) crypto testmgr acomp poll support.
> > > >     c) Modifying the default sync_mode to "async" and disabling
> > > >        verify_compress by default, to facilitate users to run IAA easily for
> > > >        comparison with software compressors.
> > > >     d) Changing the cpu-to-iaa mappings to more evenly balance cores to
> IAA
> > > >        devices.
> > > >     e) Addition of a "global_wq" per IAA, which can be used as a global
> > > >        resource for the socket. If the user configures 2WQs per IAA device,
> > > >        the driver will distribute compress jobs from all cores on the
> > > >        socket to the "global_wqs" of all the IAA devices on that socket, in
> > > >        a round-robin manner. This can be used to improve compression
> > > >        throughput for workloads that see a lot of swapout activity.
> > > >
> > > >  2) Migrating zswap to use async poll in
> zswap_compress()/decompress().
> > > >  3) A centralized batch compression API that can be used by swap
> modules.
> > > >  4) IAA compress batching within large folio zswap stores.
> > > >  5) IAA compress batching of any-order hybrid folios in
> > > >     shrink_folio_list(). The newly added "sysctl vm.compress-batchsize"
> > > >     parameter can be used to configure the number of folios in [1, 32] to
> > > >     be reclaimed using compress batching.
> > >
> > > I am still digesting this series but I have some high level questions
> > > that I left on some patches. My intuition though is that we should
> > > drop (5) from the initial proposal as it's most controversial.
> > > Batching reclaim of unrelated folios through zswap *might* make sense,
> > > but it needs a broader conversation and it needs justification on its
> > > own merit, without the rest of the series.
> >
> > Thanks for these suggestions!  Sure, I can drop (5) from the initial patch-set.
> > Agree also, this needs a broader discussion.
> >
> > I believe the 4K folios usemem30 data in this patchset does bring across
> > the batching reclaim benefits to provide justification on its own merit. I
> added
> > the data on batching reclaim with kernel compilation as part of the 4K folios
> > experiments in the IAA decompression batching patch-series [1].
> > Listing it here as well. I will make sure to add this data in subsequent revs.
> >
> > --------------------------------------------------------------------------
> >  Kernel compilation in tmpfs/allmodconfig, 2G max memory:
> >
> >  No large folios          mm-unstable-10-16-2024       shrink_folio_list()
> >                                                        batching of folios
> >  --------------------------------------------------------------------------
> >  zswap compressor         zstd       deflate-iaa       deflate-iaa
> >  vm.compress-batchsize     n/a               n/a                32
> >  vm.page-cluster             3                 3                 3
> >  --------------------------------------------------------------------------
> >  real_sec               783.87            761.69            747.32
> >  user_sec            15,750.07         15,716.69         15,728.39
> >  sys_sec              6,522.32          5,725.28          5,399.44
> >  Max_RSS_KB          1,872,640         1,870,848         1,874,432
> >
> >  zswpout            82,364,991        97,739,600       102,780,612
> >  zswpin             21,303,393        27,684,166        29,016,252
> >  pswpout                    13               222               213
> >  pswpin                     12               209               202
> >  pgmajfault         17,114,339        22,421,211        23,378,161
> >  swap_ra             4,596,035         5,840,082         6,231,646
> >  swap_ra_hit         2,903,249         3,682,444         3,940,420
> >  --------------------------------------------------------------------------
> >
> > The performance improvements seen does depend on compression batching
> in
> > the swap modules (zswap). The implementation in patch 12 in the compress
> > batching series sets up this zswap compression pipeline, that takes an array
> of
> > folios and processes them in batches of 8 pages compressed in parallel in
> hardware.
> > That being said, we do see latency improvements even with reclaim
> batching
> > combined with zswap compress batching with zstd/lzo-rle/etc. I haven't
> done a
> > lot of analysis of this, but I am guessing fewer calls from the swap layer
> > (swap_writepage()) into zswap could have something to do with this. If we
> believe
> > that batching can be the right thing to do even for the software
> compressors,
> > I can gather batching data with zstd for v2.
> 
> Thanks for sharing the data. What I meant is, I think we should focus
> on supporting large folio compression batching for this series, and
> only present figures for this support to avoid confusion.
> 
> Once this lands, we can discuss support for batching the compression
> of different unrelated folios separately, as it spans areas beyond
> just zswap and will need broader discussion.

Absolutely, this makes sense, thanks Yosry! I will address this in v2.

Thanks,
Kanchana