[PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages

Barry Song <21cnbao@xxxxxxxxx> · Fri, 22 Nov 2024 11:25:17 +1300

From: Barry Song <v-songbaohua@xxxxxxxx>

When large folios are compressed at a larger granularity, we observe
a notable reduction in CPU usage and a significant improvement in
compression ratios.

mTHP's ability to be swapped out without splitting and swapped back in
as a whole allows compression and decompression at larger granularities.

This patchset enhances zsmalloc and zram by adding support for dividing
large folios into multi-page blocks, typically configured with a
2-order granularity. Without this patchset, a large folio is always
divided into `nr_pages` 4KiB blocks.

The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER`
setting, where the default of 2 allows all anonymous THP to benefit.

Examples include:
* A 16KiB large folio will be compressed and stored as a single 16KiB
  block.
* A 64KiB large folio will be compressed and stored as four 16KiB
  blocks.

For example, swapping out and swapping in 100MiB of typical anonymous
data 100 times (with 16KB mTHP enabled) using zstd yields the following
results:

                        w/o patches        w/ patches
swap-out time(ms)       68711              49908
swap-in time(ms)        30687              20685
compression ratio       20.49%             16.9%

I deliberately created a test case with intense swap thrashing. On my
Intel i9 10-core, 20-thread PC, I imposed a 1GB memory limit on a memcg
to compile the Linux kernel, intending to amplify swap activity and
analyze its impact on system time. Using the ZSTD algorithm, my test
script, which builds the kernel for five rounds, is as follows:

#!/bin/bash

echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled

vmstat_path="/proc/vmstat"
thp_base_path="/sys/kernel/mm/transparent_hugepage"

read_values() {
    pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}')
    pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}')
    pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}')
    pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}')
    swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout 2>/dev/null || echo 0)
    swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout 2>/dev/null || echo 0)
    swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout 2>/dev/null || echo 0)

    swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin 2>/dev/null || echo 0)
    swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin 2>/dev/null || echo 0)
    swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin 2>/dev/null || echo 0)

    echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout"
}

for ((i=1; i<=5; i++))
do
  echo
  echo "*** Executing round $i ***"
  make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
  echo 3 > /proc/sys/vm/drop_caches

  #kernel build
  initial_values=($(read_values))
  time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
        CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j20 1>/dev/null 2>/dev/null
  final_values=($(read_values))

  echo "pswpin: $((final_values[0] - initial_values[0]))"
  echo "pswpout: $((final_values[1] - initial_values[1]))"
  echo "64kB-swpout: $((final_values[2] - initial_values[2]))"
  echo "32kB-swpout: $((final_values[3] - initial_values[3]))"
  echo "16kB-swpout: $((final_values[4] - initial_values[4]))"
  echo "64kB-swpin: $((final_values[5] - initial_values[5]))"
  echo "32kB-swpin: $((final_values[6] - initial_values[6]))"
  echo "pgpgin: $((final_values[8] - initial_values[8]))"
  echo "pgpgout: $((final_values[9] - initial_values[9]))"
done

******************  Test results

******* Without the patchset:

*** Executing round 1 ***

real	7m56.173s
user	81m29.401s
sys	42m57.470s
pswpin: 29815871
pswpout: 50548760
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11206086
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6596517
pgpgin: 146093656
pgpgout: 211024708

*** Executing round 2 ***

real	7m48.227s
user	81m20.558s
sys	43m0.940s
pswpin: 29798189
pswpout: 50882005
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11286587
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6596103
pgpgin: 146841468
pgpgout: 212374760

*** Executing round 3 ***

real	7m56.664s
user	81m10.936s
sys	43m5.991s
pswpin: 29760702
pswpout: 51230330
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11363346
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6586263
pgpgin: 145374744
pgpgout: 213355600

*** Executing round 4 ***

real	8m29.115s
user	81m18.955s
sys	42m49.050s
pswpin: 29651724
pswpout: 50631678
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11249036
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6583515
pgpgin: 145819060
pgpgout: 211373768

*** Executing round 5 ***

real	7m46.124s
user	80m29.780s
sys	41m37.005s
pswpin: 28805646
pswpout: 49570858
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11010873
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6391598
pgpgin: 142354376
pgpgout: 20713566

******* With the patchset:

*** Executing round 1 ***

real	7m43.760s
user	80m35.185s
sys	35m50.685s
pswpin: 29870407
pswpout: 50101263
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11140509
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6838090
pgpgin: 146500224
pgpgout: 209218896

*** Executing round 2 ***

real	7m31.820s
user	81m39.787s
sys	37m24.341s
pswpin: 31100304
pswpout: 51666202
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11471841
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 7106112
pgpgin: 151763112
pgpgout: 215526464

*** Executing round 3 ***

real	7m35.732s
user	79m36.028s
sys	34m4.190s
pswpin: 28357528
pswpout: 47716236
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 10619547
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6500899
pgpgin: 139903688
pgpgout: 199715908

*** Executing round 4 ***

real	7m38.242s
user	80m50.768s
sys	35m54.201s
pswpin: 29752937
pswpout: 49977585
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11117552
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6815571
pgpgin: 146293900
pgpgout: 208755500

*** Executing round 5 ***

real	8m2.692s
user	81m40.159s
sys	37m11.361s
pswpin: 30813683
pswpout: 51687672
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11481684
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 7044988
pgpgin: 150231840
pgpgout: 215616760

Although the real time fluctuated significantly on my PC, the
sys time has clearly decreased from over 40 minutes to just
over 30 minutes across all five rounds.

-v3:
 * Added a patch to fall back to four smaller folios to avoid partial reads.
   discussed this option with Usama, Ying, and Nhat in v2. Not entirely sure
   it will be well-received, but I've done my best to minimize the complexity
   added to do_swap_page().
 * Add a patch to adjust zstd backend estimated_src_size;
 * Addressed one VM_WARN_ON in patch 1 for PageMovable();

-v2:
 https://lore.kernel.org/linux-mm/20241107101005.69121-1-21cnbao@xxxxxxxxx/

 While it is not mature yet, I know some people are waiting for
 an update :-)
 * Fixed some stability issues.
 * rebase againest the latest mm-unstable.
 * Set default order to 2 which benefits all anon mTHP.
 * multipages ZsPageMovable is not supported yet.

Barry Song (2):
  zram: backend_zstd: Adjust estimated_src_size to accommodate
    multi-page compression
  mm: fall back to four small folios if mTHP allocation fails

Tangquan Zheng (2):
  mm: zsmalloc: support objects compressed based on multiple pages
  zram: support compression at the granularity of multi-pages

 drivers/block/zram/Kconfig        |   9 +
 drivers/block/zram/backend_zstd.c |   6 +-
 drivers/block/zram/zcomp.c        |  17 +-
 drivers/block/zram/zcomp.h        |  12 +-
 drivers/block/zram/zram_drv.c     | 450 ++++++++++++++++++++++++++++--
 drivers/block/zram/zram_drv.h     |  45 +++
 include/linux/zsmalloc.h          |  10 +-
 mm/Kconfig                        |  18 ++
 mm/memory.c                       | 203 +++++++++++++-
 mm/zsmalloc.c                     | 235 ++++++++++++----
 10 files changed, 896 insertions(+), 109 deletions(-)

-- 
2.39.3 (Apple Git-146)