Re: [RFC PATCH v1 00/13] zswap IAA compress batching

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Tue, 22 Oct 2024 17:56:58 -0700

On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
<kanchana.p.sridhar@xxxxxxxxx> wrote:
>
>
> IAA Compression Batching:
> =========================
>
> This RFC patch-series introduces the use of the Intel Analytics Accelerator
> (IAA) for parallel compression of pages in a folio, and for batched reclaim
> of hybrid any-order batches of folios in shrink_folio_list().
>
> The patch-series is organized as follows:
>
>  1) iaa_crypto driver enablers for batching: Relevant patches are tagged
>     with "crypto:" in the subject:
>
>     a) async poll crypto_acomp interface without interrupts.
>     b) crypto testmgr acomp poll support.
>     c) Modifying the default sync_mode to "async" and disabling
>        verify_compress by default, to facilitate users to run IAA easily for
>        comparison with software compressors.
>     d) Changing the cpu-to-iaa mappings to more evenly balance cores to IAA
>        devices.
>     e) Addition of a "global_wq" per IAA, which can be used as a global
>        resource for the socket. If the user configures 2WQs per IAA device,
>        the driver will distribute compress jobs from all cores on the
>        socket to the "global_wqs" of all the IAA devices on that socket, in
>        a round-robin manner. This can be used to improve compression
>        throughput for workloads that see a lot of swapout activity.
>
>  2) Migrating zswap to use async poll in zswap_compress()/decompress().
>  3) A centralized batch compression API that can be used by swap modules.
>  4) IAA compress batching within large folio zswap stores.
>  5) IAA compress batching of any-order hybrid folios in
>     shrink_folio_list(). The newly added "sysctl vm.compress-batchsize"
>     parameter can be used to configure the number of folios in [1, 32] to
>     be reclaimed using compress batching.

I am still digesting this series but I have some high level questions
that I left on some patches. My intuition though is that we should
drop (5) from the initial proposal as it's most controversial.
Batching reclaim of unrelated folios through zswap *might* make sense,
but it needs a broader conversation and it needs justification on its
own merit, without the rest of the series.

>
> IAA compress batching can be enabled only on platforms that have IAA, by
> setting this config variable:
>
>  CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y"
>
> The performance testing data with usemem 30 instances shows throughput
> gains of up to 40%, elapsed time reduction of up to 22% and sys time
> reduction of up to 30% with IAA compression batching.
>
> Our internal validation of IAA compress/decompress batching in highly
> contended Sapphire Rapids server setups with workloads running on 72 cores
> for ~25 minutes under stringent memory limit constraints have shown up to
> 50% reduction in sys time and 3.5% reduction in workload run time as
> compared to software compressors.
>
>
> System setup for testing:
> =========================
> Testing of this patch-series was done with mm-unstable as of 10-16-2024,
> commit 817952b8be34, without and with this patch-series.
> Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores
> per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk
> partition swap. Core frequency was fixed at 2500MHz.
>
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
> processes were run, each allocating and writing 10G of memory, and sleeping
> for 10 sec before exiting:
>
> usemem --init-time -w -O -s 10 -n 30 10g
>
> Other kernel configuration parameters:
>
>     zswap compressor : deflate-iaa
>     zswap allocator   : zsmalloc
>     vm.page-cluster   : 2,4
>
> IAA "compression verification" is disabled and the async poll acomp
> interface is used in the iaa_crypto driver (the defaults with this
> series).
>
>
> Performance testing (usemem30):
> ===============================
>
>  4K folios: deflate-iaa:
>  =======================
>
>  -------------------------------------------------------------------------------
>                 mm-unstable-10-16-2024  shrink_folio_list()  shrink_folio_list()
>                                          batching of folios   batching of folios
>  -------------------------------------------------------------------------------
>  zswap compressor          deflate-iaa          deflate-iaa          deflate-iaa
>  vm.compress-batchsize             n/a                    1                   32
>  vm.page-cluster                     2                    2                    2
>  -------------------------------------------------------------------------------
>  Total throughput            4,470,466            5,770,824            6,363,045
>            (KB/s)
>  Average throughput            149,015              192,360              212,101
>            (KB/s)
>  elapsed time                   119.24               100.96                92.99
>         (sec)
>  sys time (sec)               2,819.29             2,168.08             1,970.79
>
>  -------------------------------------------------------------------------------
>  memcg_high                    668,185              646,357              613,421
>  memcg_swap_fail                     0                    0                    0
>  zswpout                    62,991,796           58,275,673           53,070,201
>  zswpin                            431                  415                  396
>  pswpout                             0                    0                    0
>  pswpin                              0                    0                    0
>  thp_swpout                          0                    0                    0
>  thp_swpout_fallback                 0                    0                    0
>  pgmajfault                      3,137                3,085                3,440
>  swap_ra                            99                  100                   95
>  swap_ra_hit                        42                   44                   45
>  -------------------------------------------------------------------------------
>
>
>  16k/32/64k folios: deflate-iaa:
>  ===============================
>  All three large folio sizes 16k/32/64k were enabled to "always".
>
>  -------------------------------------------------------------------------------
>                 mm-unstable-  zswap_store()      + shrink_folio_list()
>                   10-16-2024    batching of         batching of folios
>                                    pages in
>                                large folios
>  -------------------------------------------------------------------------------
>  zswap compr     deflate-iaa     deflate-iaa          deflate-iaa
>  vm.compress-            n/a             n/a         4          8             16
>  batchsize
>  vm.page-                  2               2         2          2              2
>   cluster
>  -------------------------------------------------------------------------------
>  Total throughput   7,182,198   8,448,994    8,584,728    8,729,643    8,775,944
>            (KB/s)
>  Avg throughput       239,406     281,633      286,157      290,988      292,531
>          (KB/s)
>  elapsed time           85.04       77.84        77.03        75.18        74.98
>          (sec)
>  sys time (sec)      1,730.77    1,527.40     1,528.52     1,473.76     1,465.97
>
>  -------------------------------------------------------------------------------
>  memcg_high           648,125     694,188      696,004      699,728      724,887
>  memcg_swap_fail        1,550       2,540        1,627        1,577        1,517
>  zswpout           57,606,876  56,624,450   56,125,082    55,999,42   57,352,204
>  zswpin                   421         406          422          400          437
>  pswpout                    0           0            0            0            0
>  pswpin                     0           0            0            0            0
>  thp_swpout                 0           0            0            0            0
>  thp_swpout_fallback        0           0            0            0            0
>  16kB-mthp_swpout_          0           0            0            0            0
>           fallback
>  32kB-mthp_swpout_          0           0            0            0            0
>           fallback
>  64kB-mthp_swpout_      1,550       2,539        1,627        1,577        1,517
>           fallback
>  pgmajfault             3,102       3,126        3,473        3,454        3,134
>  swap_ra                  107         144          109          124          181
>  swap_ra_hit               51          88           45           66          107
>  ZSWPOUT-16kB               2           3            4            4            3
>  ZSWPOUT-32kB               0           2            1            1            0
>  ZSWPOUT-64kB       3,598,889   3,536,556    3,506,134    3,498,324    3,582,921
>  SWPOUT-16kB                0           0            0            0            0
>  SWPOUT-32kB                0           0            0            0            0
>  SWPOUT-64kB                0           0            0            0            0
>  -------------------------------------------------------------------------------
>
>
>  2M folios: deflate-iaa:
>  =======================
>
>  -------------------------------------------------------------------------------
>                    mm-unstable-10-16-2024    zswap_store() batching of pages
>                                                       in pmd-mappable folios
>  -------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa                deflate-iaa
>  vm.compress-batchsize                n/a                        n/a
>  vm.page-cluster                        2                          2
>  -------------------------------------------------------------------------------
>  Total throughput               7,444,592                 8,916,349
>            (KB/s)
>  Average throughput               248,153                   297,211
>            (KB/s)
>  elapsed time                       86.29                     73.44
>         (sec)
>  sys time (sec)                  1,833.21                  1,418.58
>
>  -------------------------------------------------------------------------------
>  memcg_high                        81,786                    89,905
>  memcg_swap_fail                       82                       395
>  zswpout                       58,874,092                57,721,884
>  zswpin                               422                       458
>  pswpout                                0                         0
>  pswpin                                 0                         0
>  thp_swpout                             0                         0
>  thp_swpout_fallback                   82                       394
>  pgmajfault                        14,864                    21,544
>  swap_ra                           34,953                    53,751
>  swap_ra_hit                       34,895                    53,660
>  ZSWPOUT-2048kB                   114,815                   112,269
>  SWPOUT-2048kB                          0                         0
>  -------------------------------------------------------------------------------
>
> Since 4K folios account for ~0.4% of all zswapouts when pmd-mappable folios
> are enabled for usemem30, we cannot expect much improvement from reclaim
> batching.
>
>
> Performance testing (Kernel compilation):
> =========================================
>
> As mentioned earlier, for workloads that see a lot of swapout activity, we
> can benefit from configuring 2 WQs per IAA device, with compress jobs from
> all same-socket cores being distributed toothe wq.1 of all IAAs on the
> socket, with the "global_wq" developed in this patch-series.
>
> Although this data includes IAA decompress batching, which will be
> submitted as a separate RFC patch-series, I am listing it here to quantify
> the benefit of distributing compress jobs among all IAAs. The kernel
> compilation test with "allmodconfig" is able to quantify this well:
>
>
>  4K folios: deflate-iaa: kernel compilation to quantify crypto patches
>  =====================================================================
>
>
>  ------------------------------------------------------------------------------
>                    IAA shrink_folio_list() compress batching and
>                        swapin_readahead() decompress batching
>
>                                       1WQ      2WQ (distribute compress jobs)
>
>                         1 local WQ (wq.0)    1 local WQ (wq.0) +
>                                   per IAA    1 global WQ (wq.1) per IAA
>
>  ------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa         deflate-iaa
>  vm.compress-batchsize                 32                  32
>  vm.page-cluster                        4                   4
>  ------------------------------------------------------------------------------
>  real_sec                          746.77              745.42
>  user_sec                       15,732.66           15,738.85
>  sys_sec                         5,384.14            5,247.86
>  Max_Res_Set_Size_KB            1,874,432           1,872,640
>
>  ------------------------------------------------------------------------------
>  zswpout                      101,648,460         104,882,982
>  zswpin                        27,418,319          29,428,515
>  pswpout                              213                  22
>  pswpin                               207                   6
>  pgmajfault                    21,896,616          23,629,768
>  swap_ra                        6,054,409           6,385,080
>  swap_ra_hit                    3,791,628           3,985,141
>  ------------------------------------------------------------------------------
>
> The iaa_crypto wq stats will show almost the same number of compress calls
> for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively.
> We see a latency reduction of 2.5% by distributing compress jobs among all
> IAA devices on the socket.
>
> I would greatly appreciate code review comments for the iaa_crypto driver
> and mm patches included in this series!
>
> Thanks,
> Kanchana
>
>
>
> Kanchana P Sridhar (13):
>   crypto: acomp - Add a poll() operation to acomp_alg and acomp_req
>   crypto: iaa - Add support for irq-less crypto async interface
>   crypto: testmgr - Add crypto testmgr acomp poll support.
>   mm: zswap: zswap_compress()/decompress() can submit, then poll an
>     acomp_req.
>   crypto: iaa - Make async mode the default.
>   crypto: iaa - Disable iaa_verify_compress by default.
>   crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to
>     IAAs.
>   crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA
>     node.
>   mm: zswap: Config variable to enable compress batching in
>     zswap_store().
>   mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if
>     platform has IAA.
>   mm: swap: Add IAA batch compression API
>     swap_crypto_acomp_compress_batch().
>   mm: zswap: Compress batching with Intel IAA in zswap_store() of large
>     folios.
>   mm: vmscan, swap, zswap: Compress batching of folios in
>     shrink_folio_list().
>
>  crypto/acompress.c                         |   1 +
>  crypto/testmgr.c                           |  70 +-
>  drivers/crypto/intel/iaa/iaa_crypto_main.c | 467 +++++++++++--
>  include/crypto/acompress.h                 |  18 +
>  include/crypto/internal/acompress.h        |   1 +
>  include/linux/fs.h                         |   2 +
>  include/linux/mm.h                         |   8 +
>  include/linux/writeback.h                  |   5 +
>  include/linux/zswap.h                      | 106 +++
>  kernel/sysctl.c                            |   9 +
>  mm/Kconfig                                 |  12 +
>  mm/page_io.c                               | 152 +++-
>  mm/swap.c                                  |  15 +
>  mm/swap.h                                  |  96 +++
>  mm/swap_state.c                            | 115 +++
>  mm/vmscan.c                                | 154 +++-
>  mm/zswap.c                                 | 771 +++++++++++++++++++--
>  17 files changed, 1870 insertions(+), 132 deletions(-)
>
>
> base-commit: 817952b8be34aad40e07f6832fb9d1fc08961550
> --
> 2.27.0
>
>