From: Barry Song <v-songbaohua@xxxxxxxx> When large folios are compressed at a larger granularity, we observe a notable reduction in CPU usage and a significant improvement in compression ratios. mTHP's ability to be swapped out without splitting and swapped back in as a whole allows compression and decompression at larger granularities. This patchset enhances zsmalloc and zram by adding support for dividing large folios into multi-page blocks, typically configured with a 2-order granularity. Without this patchset, a large folio is always divided into `nr_pages` 4KiB blocks. The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` setting, where the default of 2 allows all anonymous THP to benefit. Examples include: * A 16KiB large folio will be compressed and stored as a single 16KiB block. * A 64KiB large folio will be compressed and stored as four 16KiB blocks. For example, swapping out and swapping in 100MiB of typical anonymous data 100 times (with 16KB mTHP enabled) using zstd yields the following results: w/o patches w/ patches swap-out time(ms) 68711 49908 swap-in time(ms) 30687 20685 compression ratio 20.49% 16.9% I deliberately created a test case with intense swap thrashing. On my Intel i9 10-core, 20-thread PC, I imposed a 1GB memory limit on a memcg to compile the Linux kernel, intending to amplify swap activity and analyze its impact on system time. Using the ZSTD algorithm, my test script, which builds the kernel for five rounds, is as follows: #!/bin/bash echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled vmstat_path="/proc/vmstat" thp_base_path="/sys/kernel/mm/transparent_hugepage" read_values() { pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}') pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}') pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}') pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}') swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout 2>/dev/null || echo 0) swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout 2>/dev/null || echo 0) swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout 2>/dev/null || echo 0) swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin 2>/dev/null || echo 0) swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin 2>/dev/null || echo 0) swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin 2>/dev/null || echo 0) echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout" } for ((i=1; i<=5; i++)) do echo echo "*** Executing round $i ***" make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null echo 3 > /proc/sys/vm/drop_caches #kernel build initial_values=($(read_values)) time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j20 1>/dev/null 2>/dev/null final_values=($(read_values)) echo "pswpin: $((final_values[0] - initial_values[0]))" echo "pswpout: $((final_values[1] - initial_values[1]))" echo "64kB-swpout: $((final_values[2] - initial_values[2]))" echo "32kB-swpout: $((final_values[3] - initial_values[3]))" echo "16kB-swpout: $((final_values[4] - initial_values[4]))" echo "64kB-swpin: $((final_values[5] - initial_values[5]))" echo "32kB-swpin: $((final_values[6] - initial_values[6]))" echo "pgpgin: $((final_values[8] - initial_values[8]))" echo "pgpgout: $((final_values[9] - initial_values[9]))" done ****************** Test results ******* Without the patchset: *** Executing round 1 *** real 7m56.173s user 81m29.401s sys 42m57.470s pswpin: 29815871 pswpout: 50548760 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 11206086 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 6596517 pgpgin: 146093656 pgpgout: 211024708 *** Executing round 2 *** real 7m48.227s user 81m20.558s sys 43m0.940s pswpin: 29798189 pswpout: 50882005 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 11286587 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 6596103 pgpgin: 146841468 pgpgout: 212374760 *** Executing round 3 *** real 7m56.664s user 81m10.936s sys 43m5.991s pswpin: 29760702 pswpout: 51230330 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 11363346 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 6586263 pgpgin: 145374744 pgpgout: 213355600 *** Executing round 4 *** real 8m29.115s user 81m18.955s sys 42m49.050s pswpin: 29651724 pswpout: 50631678 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 11249036 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 6583515 pgpgin: 145819060 pgpgout: 211373768 *** Executing round 5 *** real 7m46.124s user 80m29.780s sys 41m37.005s pswpin: 28805646 pswpout: 49570858 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 11010873 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 6391598 pgpgin: 142354376 pgpgout: 20713566 ******* With the patchset: *** Executing round 1 *** real 7m43.760s user 80m35.185s sys 35m50.685s pswpin: 29870407 pswpout: 50101263 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 11140509 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 6838090 pgpgin: 146500224 pgpgout: 209218896 *** Executing round 2 *** real 7m31.820s user 81m39.787s sys 37m24.341s pswpin: 31100304 pswpout: 51666202 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 11471841 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 7106112 pgpgin: 151763112 pgpgout: 215526464 *** Executing round 3 *** real 7m35.732s user 79m36.028s sys 34m4.190s pswpin: 28357528 pswpout: 47716236 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 10619547 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 6500899 pgpgin: 139903688 pgpgout: 199715908 *** Executing round 4 *** real 7m38.242s user 80m50.768s sys 35m54.201s pswpin: 29752937 pswpout: 49977585 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 11117552 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 6815571 pgpgin: 146293900 pgpgout: 208755500 *** Executing round 5 *** real 8m2.692s user 81m40.159s sys 37m11.361s pswpin: 30813683 pswpout: 51687672 64kB-swpout: 0 32kB-swpout: 0 16kB-swpout: 11481684 64kB-swpin: 0 32kB-swpin: 0 16kB-swpin: 7044988 pgpgin: 150231840 pgpgout: 215616760 Although the real time fluctuated significantly on my PC, the sys time has clearly decreased from over 40 minutes to just over 30 minutes across all five rounds. -v3: * Added a patch to fall back to four smaller folios to avoid partial reads. discussed this option with Usama, Ying, and Nhat in v2. Not entirely sure it will be well-received, but I've done my best to minimize the complexity added to do_swap_page(). * Add a patch to adjust zstd backend estimated_src_size; * Addressed one VM_WARN_ON in patch 1 for PageMovable(); -v2: https://lore.kernel.org/linux-mm/20241107101005.69121-1-21cnbao@xxxxxxxxx/ While it is not mature yet, I know some people are waiting for an update :-) * Fixed some stability issues. * rebase againest the latest mm-unstable. * Set default order to 2 which benefits all anon mTHP. * multipages ZsPageMovable is not supported yet. Barry Song (2): zram: backend_zstd: Adjust estimated_src_size to accommodate multi-page compression mm: fall back to four small folios if mTHP allocation fails Tangquan Zheng (2): mm: zsmalloc: support objects compressed based on multiple pages zram: support compression at the granularity of multi-pages drivers/block/zram/Kconfig | 9 + drivers/block/zram/backend_zstd.c | 6 +- drivers/block/zram/zcomp.c | 17 +- drivers/block/zram/zcomp.h | 12 +- drivers/block/zram/zram_drv.c | 450 ++++++++++++++++++++++++++++-- drivers/block/zram/zram_drv.h | 45 +++ include/linux/zsmalloc.h | 10 +- mm/Kconfig | 18 ++ mm/memory.c | 203 +++++++++++++- mm/zsmalloc.c | 235 ++++++++++++---- 10 files changed, 896 insertions(+), 109 deletions(-) -- 2.39.3 (Apple Git-146)