On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> wrote: > > > IAA Compression Batching: > ========================= > > This RFC patch-series introduces the use of the Intel Analytics Accelerator > (IAA) for parallel compression of pages in a folio, and for batched reclaim > of hybrid any-order batches of folios in shrink_folio_list(). > > The patch-series is organized as follows: > > 1) iaa_crypto driver enablers for batching: Relevant patches are tagged > with "crypto:" in the subject: > > a) async poll crypto_acomp interface without interrupts. > b) crypto testmgr acomp poll support. > c) Modifying the default sync_mode to "async" and disabling > verify_compress by default, to facilitate users to run IAA easily for > comparison with software compressors. > d) Changing the cpu-to-iaa mappings to more evenly balance cores to IAA > devices. > e) Addition of a "global_wq" per IAA, which can be used as a global > resource for the socket. If the user configures 2WQs per IAA device, > the driver will distribute compress jobs from all cores on the > socket to the "global_wqs" of all the IAA devices on that socket, in > a round-robin manner. This can be used to improve compression > throughput for workloads that see a lot of swapout activity. > > 2) Migrating zswap to use async poll in zswap_compress()/decompress(). > 3) A centralized batch compression API that can be used by swap modules. > 4) IAA compress batching within large folio zswap stores. > 5) IAA compress batching of any-order hybrid folios in > shrink_folio_list(). The newly added "sysctl vm.compress-batchsize" > parameter can be used to configure the number of folios in [1, 32] to > be reclaimed using compress batching. I am still digesting this series but I have some high level questions that I left on some patches. My intuition though is that we should drop (5) from the initial proposal as it's most controversial. Batching reclaim of unrelated folios through zswap *might* make sense, but it needs a broader conversation and it needs justification on its own merit, without the rest of the series. > > IAA compress batching can be enabled only on platforms that have IAA, by > setting this config variable: > > CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y" > > The performance testing data with usemem 30 instances shows throughput > gains of up to 40%, elapsed time reduction of up to 22% and sys time > reduction of up to 30% with IAA compression batching. > > Our internal validation of IAA compress/decompress batching in highly > contended Sapphire Rapids server setups with workloads running on 72 cores > for ~25 minutes under stringent memory limit constraints have shown up to > 50% reduction in sys time and 3.5% reduction in workload run time as > compared to software compressors. > > > System setup for testing: > ========================= > Testing of this patch-series was done with mm-unstable as of 10-16-2024, > commit 817952b8be34, without and with this patch-series. > Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores > per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk > partition swap. Core frequency was fixed at 2500MHz. > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem > processes were run, each allocating and writing 10G of memory, and sleeping > for 10 sec before exiting: > > usemem --init-time -w -O -s 10 -n 30 10g > > Other kernel configuration parameters: > > zswap compressor : deflate-iaa > zswap allocator : zsmalloc > vm.page-cluster : 2,4 > > IAA "compression verification" is disabled and the async poll acomp > interface is used in the iaa_crypto driver (the defaults with this > series). > > > Performance testing (usemem30): > =============================== > > 4K folios: deflate-iaa: > ======================= > > ------------------------------------------------------------------------------- > mm-unstable-10-16-2024 shrink_folio_list() shrink_folio_list() > batching of folios batching of folios > ------------------------------------------------------------------------------- > zswap compressor deflate-iaa deflate-iaa deflate-iaa > vm.compress-batchsize n/a 1 32 > vm.page-cluster 2 2 2 > ------------------------------------------------------------------------------- > Total throughput 4,470,466 5,770,824 6,363,045 > (KB/s) > Average throughput 149,015 192,360 212,101 > (KB/s) > elapsed time 119.24 100.96 92.99 > (sec) > sys time (sec) 2,819.29 2,168.08 1,970.79 > > ------------------------------------------------------------------------------- > memcg_high 668,185 646,357 613,421 > memcg_swap_fail 0 0 0 > zswpout 62,991,796 58,275,673 53,070,201 > zswpin 431 415 396 > pswpout 0 0 0 > pswpin 0 0 0 > thp_swpout 0 0 0 > thp_swpout_fallback 0 0 0 > pgmajfault 3,137 3,085 3,440 > swap_ra 99 100 95 > swap_ra_hit 42 44 45 > ------------------------------------------------------------------------------- > > > 16k/32/64k folios: deflate-iaa: > =============================== > All three large folio sizes 16k/32/64k were enabled to "always". > > ------------------------------------------------------------------------------- > mm-unstable- zswap_store() + shrink_folio_list() > 10-16-2024 batching of batching of folios > pages in > large folios > ------------------------------------------------------------------------------- > zswap compr deflate-iaa deflate-iaa deflate-iaa > vm.compress- n/a n/a 4 8 16 > batchsize > vm.page- 2 2 2 2 2 > cluster > ------------------------------------------------------------------------------- > Total throughput 7,182,198 8,448,994 8,584,728 8,729,643 8,775,944 > (KB/s) > Avg throughput 239,406 281,633 286,157 290,988 292,531 > (KB/s) > elapsed time 85.04 77.84 77.03 75.18 74.98 > (sec) > sys time (sec) 1,730.77 1,527.40 1,528.52 1,473.76 1,465.97 > > ------------------------------------------------------------------------------- > memcg_high 648,125 694,188 696,004 699,728 724,887 > memcg_swap_fail 1,550 2,540 1,627 1,577 1,517 > zswpout 57,606,876 56,624,450 56,125,082 55,999,42 57,352,204 > zswpin 421 406 422 400 437 > pswpout 0 0 0 0 0 > pswpin 0 0 0 0 0 > thp_swpout 0 0 0 0 0 > thp_swpout_fallback 0 0 0 0 0 > 16kB-mthp_swpout_ 0 0 0 0 0 > fallback > 32kB-mthp_swpout_ 0 0 0 0 0 > fallback > 64kB-mthp_swpout_ 1,550 2,539 1,627 1,577 1,517 > fallback > pgmajfault 3,102 3,126 3,473 3,454 3,134 > swap_ra 107 144 109 124 181 > swap_ra_hit 51 88 45 66 107 > ZSWPOUT-16kB 2 3 4 4 3 > ZSWPOUT-32kB 0 2 1 1 0 > ZSWPOUT-64kB 3,598,889 3,536,556 3,506,134 3,498,324 3,582,921 > SWPOUT-16kB 0 0 0 0 0 > SWPOUT-32kB 0 0 0 0 0 > SWPOUT-64kB 0 0 0 0 0 > ------------------------------------------------------------------------------- > > > 2M folios: deflate-iaa: > ======================= > > ------------------------------------------------------------------------------- > mm-unstable-10-16-2024 zswap_store() batching of pages > in pmd-mappable folios > ------------------------------------------------------------------------------- > zswap compressor deflate-iaa deflate-iaa > vm.compress-batchsize n/a n/a > vm.page-cluster 2 2 > ------------------------------------------------------------------------------- > Total throughput 7,444,592 8,916,349 > (KB/s) > Average throughput 248,153 297,211 > (KB/s) > elapsed time 86.29 73.44 > (sec) > sys time (sec) 1,833.21 1,418.58 > > ------------------------------------------------------------------------------- > memcg_high 81,786 89,905 > memcg_swap_fail 82 395 > zswpout 58,874,092 57,721,884 > zswpin 422 458 > pswpout 0 0 > pswpin 0 0 > thp_swpout 0 0 > thp_swpout_fallback 82 394 > pgmajfault 14,864 21,544 > swap_ra 34,953 53,751 > swap_ra_hit 34,895 53,660 > ZSWPOUT-2048kB 114,815 112,269 > SWPOUT-2048kB 0 0 > ------------------------------------------------------------------------------- > > Since 4K folios account for ~0.4% of all zswapouts when pmd-mappable folios > are enabled for usemem30, we cannot expect much improvement from reclaim > batching. > > > Performance testing (Kernel compilation): > ========================================= > > As mentioned earlier, for workloads that see a lot of swapout activity, we > can benefit from configuring 2 WQs per IAA device, with compress jobs from > all same-socket cores being distributed toothe wq.1 of all IAAs on the > socket, with the "global_wq" developed in this patch-series. > > Although this data includes IAA decompress batching, which will be > submitted as a separate RFC patch-series, I am listing it here to quantify > the benefit of distributing compress jobs among all IAAs. The kernel > compilation test with "allmodconfig" is able to quantify this well: > > > 4K folios: deflate-iaa: kernel compilation to quantify crypto patches > ===================================================================== > > > ------------------------------------------------------------------------------ > IAA shrink_folio_list() compress batching and > swapin_readahead() decompress batching > > 1WQ 2WQ (distribute compress jobs) > > 1 local WQ (wq.0) 1 local WQ (wq.0) + > per IAA 1 global WQ (wq.1) per IAA > > ------------------------------------------------------------------------------ > zswap compressor deflate-iaa deflate-iaa > vm.compress-batchsize 32 32 > vm.page-cluster 4 4 > ------------------------------------------------------------------------------ > real_sec 746.77 745.42 > user_sec 15,732.66 15,738.85 > sys_sec 5,384.14 5,247.86 > Max_Res_Set_Size_KB 1,874,432 1,872,640 > > ------------------------------------------------------------------------------ > zswpout 101,648,460 104,882,982 > zswpin 27,418,319 29,428,515 > pswpout 213 22 > pswpin 207 6 > pgmajfault 21,896,616 23,629,768 > swap_ra 6,054,409 6,385,080 > swap_ra_hit 3,791,628 3,985,141 > ------------------------------------------------------------------------------ > > The iaa_crypto wq stats will show almost the same number of compress calls > for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively. > We see a latency reduction of 2.5% by distributing compress jobs among all > IAA devices on the socket. > > I would greatly appreciate code review comments for the iaa_crypto driver > and mm patches included in this series! > > Thanks, > Kanchana > > > > Kanchana P Sridhar (13): > crypto: acomp - Add a poll() operation to acomp_alg and acomp_req > crypto: iaa - Add support for irq-less crypto async interface > crypto: testmgr - Add crypto testmgr acomp poll support. > mm: zswap: zswap_compress()/decompress() can submit, then poll an > acomp_req. > crypto: iaa - Make async mode the default. > crypto: iaa - Disable iaa_verify_compress by default. > crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to > IAAs. > crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA > node. > mm: zswap: Config variable to enable compress batching in > zswap_store(). > mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if > platform has IAA. > mm: swap: Add IAA batch compression API > swap_crypto_acomp_compress_batch(). > mm: zswap: Compress batching with Intel IAA in zswap_store() of large > folios. > mm: vmscan, swap, zswap: Compress batching of folios in > shrink_folio_list(). > > crypto/acompress.c | 1 + > crypto/testmgr.c | 70 +- > drivers/crypto/intel/iaa/iaa_crypto_main.c | 467 +++++++++++-- > include/crypto/acompress.h | 18 + > include/crypto/internal/acompress.h | 1 + > include/linux/fs.h | 2 + > include/linux/mm.h | 8 + > include/linux/writeback.h | 5 + > include/linux/zswap.h | 106 +++ > kernel/sysctl.c | 9 + > mm/Kconfig | 12 + > mm/page_io.c | 152 +++- > mm/swap.c | 15 + > mm/swap.h | 96 +++ > mm/swap_state.c | 115 +++ > mm/vmscan.c | 154 +++- > mm/zswap.c | 771 +++++++++++++++++++-- > 17 files changed, 1870 insertions(+), 132 deletions(-) > > > base-commit: 817952b8be34aad40e07f6832fb9d1fc08961550 > -- > 2.27.0 > >