Re: [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> writes:

> Hi All,
>
> This patch-series enables zswap_store() to accept and store mTHP
> folios. The most significant contribution in this series is from the 
> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> migrated to v6.11-rc3 in patch 2/4 of this series.
>
> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
>      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@xxxxxxx/T/#u
>
> Additionally, there is an attempt to modularize some of the functionality
> in zswap_store(), to make it more amenable to supporting any-order
> mTHPs.
>
> For instance, the determination of whether a folio is same-filled is
> based on mapping an index into the folio to derive the page. Likewise,
> there is a function "zswap_store_entry" added to store a zswap_entry in
> the xarray.
>
> For accounting purposes, the patch-series adds per-order mTHP sysfs
> "zswpout" counters that get incremented upon successful zswap_store of
> an mTHP folio:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>
> This patch-series is a precursor to ZSWAP compress batching of mTHP
> swap-out and decompress batching of swap-ins based on swapin_readahead(),
> using Intel IAA hardware acceleration, which we would like to submit in
> subsequent RFC patch-series, with performance improvement data.
>
> Thanks to Ying Huang for pre-posting review feedback and suggestions!
>
> Changes since RFC v1:
> =====================
>
> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
>    Thanks Barry!
> 2) Addressed some of the code review comments that Nhat Pham provided in
>    Ryan's initial RFC [1]:
>    - Added a comment about the cgroup zswap limit checks occuring once per
>      folio at the beginning of zswap_store().
>      Nhat, Ryan, please do let me know if the comments convey the summary
>      from the RFC discussion. Thanks!
>    - Posted data on running the cgroup suite's zswap kselftest.
> 3) Rebased to v6.11-rc3.
> 4) Gathered performance data with usemem and the rebased patch-series.
>
> Performance Testing:
> ====================
> Testing of this patch-series was done with the v6.11-rc3 mainline, without
> and with this patch-series, on an Intel Sapphire Rapids server,
> dual-socket 56 cores per socket, 4 IAA devices per socket.
>
> The system has 503 GiB RAM, 176 GiB swap/ZSWAP with ZRAM as the backing
> swap device. Core frequency was fixed at 2500MHz.

I don't think that this is a reasonable test configuration, there's no
benefit to use ZSWAP+ZRAM.  We should use a normal SSD as backing swap
device.

> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 40G. Following a similar methodology as in Ryan Roberts'
> "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> run, each allocating and writing 1G of memory:
>
>     usemem --init-time -w -O -n 70 1g
>
> Other kernel configuration parameters:
>
>     ZSWAP Compressor  : LZ4, DEFLATE-IAA
>     ZSWAP Allocator   : ZSMALLOC
>     ZRAM Compressor   : LZO-RLE
>     SWAP page-cluster : 2
>
> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> IAA "compression verification" is enabled. Hence each IAA compression
> will be decompressed internally by the "iaa_crypto" driver, the crc-s
> returned by the hardware will be compared and errors reported in case of
> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> compared to the software compressors.
>
> Throughput reported by usemem and perf sys time for running the test
> are as follows:
>
>  64KB mTHP:
>  ==========
>   ------------------------------------------------------------------
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
>  |                    |                   |       KB/s |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | ZRAM lzo-rle      |    118,928 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |     82,665 |       -30% |

Because the test configuration isn't reasonable, the performance drop
isn't reasonable too.  We should compare between zswap+SSD w/o mTHP
zswap and zswap+SSD w/ mTHP zswap.  I think that there should be
performance improvement for that.

>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    176,210 |        48% |
>  |------------------------------------------------------------------|
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
>  |                    |                   |        sec |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | ZRAM lzo-rle      |   1,032.20 |   Baseline |
>  |zswap-mTHP=Store    | ZSWAP lz4         |   1,854.51 |       -80% |
>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     582.71 |        44% |
>   ------------------------------------------------------------------
>
>   -----------------------------------------------------------------------
>  | VMSTATS, mTHP ZSWAP stats,   |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
>  | mTHP ZRAM stats:             |   mainline |       Store |       Store |
>  |                              |            |         lz4 | deflate-iaa |
>  |-----------------------------------------------------------------------|
>  | pswpin                       |         16 |           0 |           0 |
>  | pswpout                      |  7,770,720 |           0 |           0 |
>  | zswpin                       |        547 |         695 |         579 |
>  | zswpout                      |      1,394 |  15,462,778 |   7,284,554 |
>  |-----------------------------------------------------------------------|
>  | thp_swpout                   |          0 |           0 |           0 |
>  | thp_swpout_fallback          |          0 |           0 |           0 |
>  | pgmajfault                   |      3,786 |       3,541 |       3,367 |
>  |-----------------------------------------------------------------------|
>  | hugepages-64kB/stats/zswpout |            |     966,328 |     455,196 |
>  |-----------------------------------------------------------------------|
>  | hugepages-64kB/stats/swpout  |    485,670 |           0 |           0 |
>   -----------------------------------------------------------------------
>
>
>  2MB PMD-THP/2048K mTHP:
>  =======================
>   ------------------------------------------------------------------
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
>  |                    |                   |       KB/s |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | ZRAM lzo-rle      |    177,340 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |     84,030 |       -53% |
>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    185,691 |         5% |
>  |------------------------------------------------------------------|
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
>  |                    |                   |        sec |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | ZRAM lzo-rle      |     876.29 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |   1,740.55 |       -99% |
>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     650.33 |        26% |
>   ------------------------------------------------------------------
>
>   ------------------------------------------------------------------------- 
>  | VMSTATS, mTHP ZSWAP stats,     |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
>  | mTHP ZRAM stats:               |   mainline |       Store |       Store |
>  |                                |            |         lz4 | deflate-iaa |
>  |-------------------------------------------------------------------------|
>  | pswpin                         |          0 |           0 |           0 |
>  | pswpout                        |  8,628,224 |           0 |           0 |
>  | zswpin                         |        678 |      22,733 |       1,641 |
>  | zswpout                        |      1,481 |  14,828,597 |   9,404,937 |
>  |-------------------------------------------------------------------------|
>  | thp_swpout                     |     16,852 |           0 |           0 |
>  | thp_swpout_fallback            |          0 |           0 |           0 |
>  | pgmajfault                     |      3,467 |      25,550 |       4,800 |
>  |-------------------------------------------------------------------------|
>  | hugepages-2048kB/stats/zswpout |            |      28,924 |      18,366 |
>  |-------------------------------------------------------------------------|
>  | hugepages-2048kB/stats/swpout  |     16,852 |           0 |           0 |
>   -------------------------------------------------------------------------
>
> As expected, in the "Before" experiment, there are relatively fewer
> swapouts because ZRAM utilization is not accounted in the cgroup.
>
> With the introduction of zswap_store mTHP, the "After" data reflects the
> higher swapout activity, and consequent throughput/sys time degradation
> when LZ4 is used as the zswap compressor. However, we observe considerable
> throughput and sys time improvement in the "After" data when DEFLATE-IAA
> is the zswap compressor. This observation holds for 64K mTHP and 2MB THP
> experiments. IAA's higher compression ratio and better compress latency
> can be attributed to fewer swap-outs and major page-faults, that result
> in better throughput and sys time.
>
> Our goal is to improve ZSWAP mTHP store performance using batching. With
> Intel IAA compress/decompress batching used in ZSWAP (to be submitted as
> additional RFC series), we are able to demonstrate significant
> performance improvements and memory savings with IAA as compared to
> software compressors.
>
> cgroup zswap kselftest:
> =======================
>
> "Before":
> =========
>   Test run with v6.11-rc3 and no code changes:
>     mTHP 64K set to 'always'
>     zswap compressor set to 'lz4'
>     page-cluster = 3
>
>   zswap shrinker_enabled = N:
>   ---------------------------
>   ok 1 test_zswap_usage
>   ok 2 test_swapin_nozswap
>   # at least 24MB should be brought back from zswap
>   not ok 3 test_zswapin
>   # zswpwb_after is 0 while wb is enablednot ok 4 test_zswap_writeback_enabled
>   # Failed to reclaim all of the requested memory
>   not ok 5 test_zswap_writeback_disabled
>   ok 6 # SKIP test_no_kmem_bypass
>   ok 7 test_no_invasive_cgroup_shrink
>
>   zswap shrinker_enabled = Y:
>   ---------------------------
>   ok 1 test_zswap_usage
>   ok 2 test_swapin_nozswap
>   # at least 24MB should be brought back from zswap
>   not ok 3 test_zswapin
>   # zswpwb_after is 0 while wb is enablednot ok 4 test_zswap_writeback_enabled
>   # Failed to reclaim all of the requested memory
>   not ok 5 test_zswap_writeback_disabled
>   ok 6 # SKIP test_no_kmem_bypass
>   not ok 7 test_no_invasive_cgroup_shrink
>
> "After":
> ========
>   Test run with this patch-series and v6.11-rc3:
>     mTHP 64K set to 'always'
>     zswap compressor set to 'deflate-iaa'
>     page-cluster = 3
>
>   zswap shrinker_enabled = N:
>   ---------------------------
>   ok 1 test_zswap_usage
>   ok 2 test_swapin_nozswap
>   ok 3 test_zswapin
>   ok 4 test_zswap_writeback_enabled
>   ok 5 test_zswap_writeback_disabled
>   ok 6 # SKIP test_no_kmem_bypass
>   ok 7 test_no_invasive_cgroup_shrink
>   
>   zswap shrinker_enabled = Y:
>   ---------------------------
>   ok 1 test_zswap_usage
>   ok 2 test_swapin_nozswap
>   # at least 24MB should be brought back from zswap
>   not ok 3 test_zswapin
>   ok 4 test_zswap_writeback_enabled
>   ok 5 test_zswap_writeback_disabled
>   ok 6 # SKIP test_no_kmem_bypass
>   not ok 7 test_no_invasive_cgroup_shrink
>
> I haven't taken an in-depth look into the cgroup zswap tests, but it
> looks like the results with the patch-series are no worse than without,
> and in some cases better (not exactly sure why, this needs more
> analysis).
>
> I would greatly appreciate your code review comments and suggestions!
>
> Thanks,
> Kanchana
>
> [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@xxxxxxx/
>
>
> Kanchana P Sridhar (4):
>   mm: zswap: zswap_is_folio_same_filled() takes an index in the folio.
>   mm: zswap: zswap_store() extended to handle mTHP folios.
>   mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats.
>   mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.
>
>  include/linux/huge_mm.h |   1 +
>  mm/huge_memory.c        |   2 +
>  mm/page_io.c            |   7 ++
>  mm/zswap.c              | 238 +++++++++++++++++++++++++++++-----------
>  4 files changed, 184 insertions(+), 64 deletions(-)

--
Best Regards,
Huang, Ying




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux