RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Nhat,

> -----Original Message-----
> From: Nhat Pham <nphamcs@xxxxxxxxx>
> Sent: Wednesday, August 21, 2024 7:43 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; ryan.roberts@xxxxxxx;
> Huang, Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Sun, Aug 18, 2024 at 10:16 PM Kanchana P Sridhar
> <kanchana.p.sridhar@xxxxxxxxx> wrote:
> >
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to v6.11-rc3 in patch 2/4 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> >      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@xxxxxxx/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs.
> >
> > For instance, the determination of whether a folio is same-filled is
> > based on mapping an index into the folio to derive the page. Likewise,
> > there is a function "zswap_store_entry" added to store a zswap_entry in
> > the xarray.
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent RFC patch-series, with performance improvement data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Changes since v3:
> > =================
> > 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> >    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> >    changes to count_mthp_stat() so that it's always defined, even when THP
> >    is disabled. Barry, I have also made one other change in page_io.c
> >    where count_mthp_stat() is called by count_swpout_vm_event(). I would
> >    appreciate it if you can review this. Thanks!
> >    Hopefully this should resolve the kernel robot build errors.
> >
> > Changes since v2:
> > =================
> > 1) Gathered usemem data using SSD as the backing swap device for zswap,
> >    as suggested by Ying Huang. Ying, I would appreciate it if you can
> >    review the latest data. Thanks!
> > 2) Generated the base commit info in the patches to attempt to address
> >    the kernel test robot build errors.
> > 3) No code changes to the individual patches themselves.
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> >    Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> >    Ryan's initial RFC [1]:
> >    - Added a comment about the cgroup zswap limit checks occuring once
> per
> >      folio at the beginning of zswap_store().
> >      Nhat, Ryan, please do let me know if the comments convey the summary
> >      from the RFC discussion. Thanks!
> >    - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, with a 4G SSD as the backing swap device for
> > ZSWAP. Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed. Following a similar methodology as in Ryan Roberts'
> > "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> > run, each allocating and writing 1G of memory:
> >
> >     usemem --init-time -w -O -n 70 1g
> >
> > Since I was constrained to get the 70 usemem processes to generate
> > swapout activity with the 4G SSD, I ended up using different cgroup
> > memory.high fixed limits for the experiments with 64K mTHP and 2M THP:
> >
> > 64K mTHP experiments: cgroup memory fixed at 60G
> > 2M THP experiments  : cgroup memory fixed at 55G
> >
> > The vm/sysfs stats included after the performance data provide details
> > on the swapout activity to SSD/ZSWAP.
> >
> > Other kernel configuration parameters:
> >
> >     ZSWAP Compressor  : LZ4, DEFLATE-IAA
> >     ZSWAP Allocator   : ZSMALLOC
> >     SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput reported by usemem and perf sys time for running the test
> > are as follows, averaged across 3 runs:
> >
> >  64KB mTHP (cgroup memory.high set to 60G):
> >  ==========================================
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    388,154 |        16% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
> >  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     235.60 |      -158% |
> >   ------------------------------------------------------------------
> 
> Yeah no, this is not good. That throughput regression is concerning...
> 
> Is this tied to lz4 only, or do you observe similar trends in other
> compressors that are not deflate-iaa?

Let me gather data with other software compressors.

> 
> 
> >
> >   -----------------------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-
> mTHP |
> >  |                              |   mainline |       Store |       Store |
> >  |                              |            |         lz4 | deflate-iaa |
> >  |-----------------------------------------------------------------------|
> >  | pswpin                       |          0 |           0 |           0 |
> >  | pswpout                      |    174,432 |           0 |           0 |
> >  | zswpin                       |        703 |         534 |         721 |
> >  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
> >  |-----------------------------------------------------------------------|
> >  | thp_swpout                   |          0 |           0 |           0 |
> >  | thp_swpout_fallback          |          0 |           0 |           0 |
> >  | pgmajfault                   |      3,364 |       3,650 |       3,431 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
> >   -----------------------------------------------------------------------
> >
> 
> Yeah this is not good. Something fishy is going on, if we see this
> ginormous jump from 175000 (z)swpout pages to almost 1.5 million
> pages. That's a massive jump.
> 
> Either it's:
> 
> 1.Your theory - zswap store keeps banging on the limit (which suggests
> incompatibility between the way zswap currently behaves and our
> reclaim logic)
> 
> 2. The data here is ridiculously incompressible. We're needing to
> zswpout roughly 8.5 times the number of pages, so the saving is 8.5
> less => we only save 11.76% of memory for each page??? That's not
> right...
> 
> 3. There's an outright bug somewhere.
> 
> Very suspicious.
> 
> >
> >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 55G):
> >  =======================================================
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |    190,827 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |     32,026 |       -83% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    203,772 |         7% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |      27.23 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |     156.52 |      -475% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     171.45 |      -530% |
> >   ------------------------------------------------------------------
> 
> I'm confused. This is a *regression* right? A massive one that is -
> sys time is *more* than 5 times the old value?
> 
> >
> >   -------------------------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP/SSD stats  |  v6.11-rc3 |  zswap-mTHP |  zswap-
> mTHP |
> >  |                                |   mainline |       Store |       Store |
> >  |                                |            |         lz4 | deflate-iaa |
> >  |-------------------------------------------------------------------------|
> >  | pswpin                         |          0 |           0 |           0 |
> >  | pswpout                        |    797,184 |           0 |           0 |
> >  | zswpin                         |        690 |         649 |         669 |
> >  | zswpout                        |      1,465 |   1,596,382 |   1,540,766 |
> >  |-------------------------------------------------------------------------|
> >  | thp_swpout                     |      1,557 |           0 |           0 |
> >  | thp_swpout_fallback            |          0 |       3,248 |       3,752 |
> 
> This is also increased, but I supposed we're just doing more
> (z)swapping out in general...
> 
> >  | pgmajfault                     |      3,726 |       6,470 |       5,691 |
> >  |-------------------------------------------------------------------------|
> >  | hugepages-2048kB/stats/zswpout |            |       2,416 |       2,261 |
> >  |-------------------------------------------------------------------------|
> >  | hugepages-2048kB/stats/swpout  |      1,557 |           0 |           0 |
> >   -------------------------------------------------------------------------
> >
> 
> I'm not trying to delay this patch - I fully believe in supporting
> zswap for larger pages (both mTHP and THP - whatever the memory
> reclaim subsystem throws at us).
> 
> But we need to get to the bottom of this :) These are very suspicious
> and concerning data. If this is something urgent, I can live with a
> gate to enable/disable this, but I'd much prefer we understand what's
> going on here.

Thanks for this analysis. I will debug this some more, so we can better
understand these results.

Thanks,
Kanchana




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux