This RFC patch-series enables zswap_store() to accept and store mTHP folios. The most significant contribution in this series is from the earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been migrated to v6.10 in patch [3] of this series. [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@xxxxxxx/T/#u Additionally, there is an attempt to modularize some of the functionality in zswap_store(), to make it more amenable to supporting any-order mTHPs. For instance, the determination of whether a folio is same-filled is based on mapping an index into the folio to derive the page. Likewise, there is a function "zswap_store_entry" added to store a zswap_entry in the xarray. For testing purposes, per-mTHP size vmstat zswap_store event counters are added, and incremented upon successful zswap_store of an mTHP. This patch-series is a precursor to ZSWAP compress batching of mTHP swap-out and decompress batching of swap-ins based on swapin_readahead(), using Intel IAA hardware acceleration, which we would like to submit in subsequent RFC patch-series, with performance improvement data. Performance Testing: ==================== Testing of this patch-series was done with the v6.10 mainline, without and with this RFC, on an Intel Sapphire Rapids server, dual-socket 56 cores per socket, 4 IAA devices per socket. The system has 503 GiB RAM, 176 GiB swap/ZSWAP with ZRAM as the backing swap device. Core frequency was fixed at 2500MHz. The vm-scalability "usemem" test was run in a cgroup whose memory.high was fixed at 40G. Following a similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting" series [2], 70 usemem processes were run, each allocating and writing 1G of memory: usemem --init-time -w -O -n 70 1g Other kernel configuration parameters: ZSWAP Compressor : LZ4, DEFLATE-IAA ZSWAP Allocator : ZSMALLOC ZRAM Compressor : LZO-RLE SWAP page-cluster : 2 In the experiments where "deflate-iaa" is used as the ZSWAP compressor, IAA "compression verification" is enabled. Hence each IAA compression will be decompressed internally by the "iaa_crypto" driver, the crc-s returned by the hardware will be compared and errors reported in case of mismatches. Thus "deflate-iaa" helps ensure better data integrity as compared to the software compressors. Throughput reported by usemem and perf sys time for running the test were measured and averaged across 3 runs: 64KB mTHP: ========== ---------------------------------------------------------- | | | | | |Kernel | mTHP SWAP-OUT | Throughput | Improvement| | | | KB/s | | |----------------|---------------|------------|------------| |v6.10 mainline | ZRAM lzo-rle | 111,180 | Baseline | |zswap-mTHP-RFC | ZSWAP lz4 | 115,996 | 4% | |zswap-mTHP-RFC | ZSWAP | | | | | deflate-iaa | 166,048 | 49% | |----------------------------------------------------------| | | | | | |Kernel | mTHP SWAP-OUT | Sys time | Improvement| | | | sec | | |----------------|---------------|------------|------------| |v6.10 mainline | ZRAM lzo-rle | 1,049.69 | Baseline | |zswap-mTHP RFC | ZSWAP lz4 | 1,178.20 | -12% | |zswap-mTHP-RFC | ZSWAP | | | | | deflate-iaa | 626.12 | 40% | ---------------------------------------------------------- ------------------------------------------------------- | VMSTATS, mTHP ZSWAP stats, | v6.10 | zswap-mTHP | | mTHP ZRAM stats: | mainline | RFC | |-------------------------------------------------------| | pswpin | 16 | 0 | | pswpout | 7,823,984 | 0 | | zswpin | 551 | 647 | | zswpout | 1,410 | 15,175,113 | |-------------------------------------------------------| | thp_swpout | 0 | 0 | | thp_swpout_fallback | 0 | 0 | | pgmajfault | 2,189 | 2,241 | |-------------------------------------------------------| | zswpout_4kb_folio | | 1,497 | | mthp_zswpout_64kb | | 948,351 | |-------------------------------------------------------| | hugepages-64kB/stats/swpout| 488,999 | 0 | ------------------------------------------------------- 2MB PMD-THP/2048K mTHP: ======================= ---------------------------------------------------------- | | | | | |Kernel | mTHP SWAP-OUT | Throughput | Improvement| | | | KB/s | | |----------------|---------------|------------|------------| |v6.10 mainline | ZRAM lzo-rle | 136,617 | Baseline | |zswap-mTHP-RFC | ZSWAP lz4 | 137,360 | 1% | |zswap-mTHP-RFC | ZSWAP | | | | | deflate-iaa | 179,097 | 31% | |----------------------------------------------------------| | | | | | |Kernel | mTHP SWAP-OUT | Sys time | Improvement| | | | sec | | |----------------|---------------|------------|------------| |v6.10 mainline | ZRAM lzo-rle | 1,044.40 | Baseline | |zswap-mTHP RFC | ZSWAP lz4 | 1,035.79 | 1% | |zswap-mTHP-RFC | ZSWAP | | | | | deflate-iaa | 571.31 | 45% | ---------------------------------------------------------- --------------------------------------------------------- | VMSTATS, mTHP ZSWAP stats, | v6.10 | zswap-mTHP | | mTHP ZRAM stats: | mainline | RFC | |---------------------------------------------------------| | pswpin | 0 | 0 | | pswpout | 8,630,272 | 0 | | zswpin | 565 | 6,901 | | zswpout | 1,388 | 15,379,163 | |---------------------------------------------------------| | thp_swpout | 16,856 | 0 | | thp_swpout_fallback | 0 | 0 | | pgmajfault | 2,184 | 8,532 | |---------------------------------------------------------| | zswpout_4kb_folio | | 5,851 | | mthp_zswpout_2048kb | | 30,026 | | zswpout_pmd_thp_folio | | 30,026 | |---------------------------------------------------------| | hugepages-2048kB/stats/swpout| 16,856 | 0 | --------------------------------------------------------- As expected in the "Before" experiment, there are relatively fewer swapouts, because ZRAM utilization is not accounted in the cgroup. With the introduction of zswap_store mTHP, the "After" data reflects the higher swapout activity, and consequent sys time degradation. Our goal is to improve ZSWAP mTHP store performance using batching. With Intel IAA compress/decompress batching used in ZSWAP (to be submitted as additional RFC series), we are able to demonstrate significant performance improvements with IAA as compared to software compressors. For instance, with IAA-Canned compression [3] used with batching of zswap_stores and of zswap_loads, the usemem experiment's average of 3 runs throughput improves to 170,461 KB/s (64KB mTHP) and 188,325 KB/s (2MB THP). [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@xxxxxxx/ [3] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@xxxxxxxxxxxxxxx/ Kanchana P Sridhar (4): mm: zswap: zswap_is_folio_same_filled() takes an index in the folio. mm: vmstat: Per mTHP-size zswap_store vmstat event counters. mm: zswap: zswap_store() extended to handle mTHP folios. mm: page_io: Count successful mTHP zswap stores in vmstat. include/linux/vm_event_item.h | 15 +++ mm/page_io.c | 44 +++++++ mm/vmstat.c | 15 +++ mm/zswap.c | 223 ++++++++++++++++++++++++---------- 4 files changed, 233 insertions(+), 64 deletions(-) -- 2.27.0