On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> wrote: > > Hi All, > > This patch-series enables zswap_store() to accept and store mTHP > folios. The most significant contribution in this series is from the > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > migrated to v6.11-rc3 in patch 2/4 of this series. > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@xxxxxxx/T/#u > > Additionally, there is an attempt to modularize some of the functionality > in zswap_store(), to make it more amenable to supporting any-order > mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to > delete all offsets corresponding to a higher order folio stored in zswap. > > For accounting purposes, the patch-series adds per-order mTHP sysfs > "zswpout" counters that get incremented upon successful zswap_store of > an mTHP folio: > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > This patch-series is a precursor to ZSWAP compress batching of mTHP > swap-out and decompress batching of swap-ins based on swapin_readahead(), > using Intel IAA hardware acceleration, which we would like to submit in > subsequent RFC patch-series, with performance improvement data. > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > Changes since v4: > ================= > 1) Published before/after data with zstd, as suggested by Nhat (Thanks > Nhat for the data reviews!). > 2) Rebased to mm-unstable from 8/27/2024, > commit b659edec079c90012cf8d05624e312d1062b8b87. > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if > CONFIG_MEMCG is not defined, to resolve build errors reported by kernel > robot; as per Nhat's and Michal's suggestion to not require a separate > patch to fix the build errors (thanks both!). > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as > suggested by Yosry (Thanks Yosry!). > 5) Squashed the commits that define new mthp zswpout stat counters, and > invoke count_mthp_stat() after successful zswap_store()s; into a single > commit. Thanks Yosry for this suggestion! > > Changes since v3: > ================= > 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444. > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > changes to count_mthp_stat() so that it's always defined, even when THP > is disabled. Barry, I have also made one other change in page_io.c > where count_mthp_stat() is called by count_swpout_vm_event(). I would > appreciate it if you can review this. Thanks! > Hopefully this should resolve the kernel robot build errors. > > Changes since v2: > ================= > 1) Gathered usemem data using SSD as the backing swap device for zswap, > as suggested by Ying Huang. Ying, I would appreciate it if you can > review the latest data. Thanks! > 2) Generated the base commit info in the patches to attempt to address > the kernel test robot build errors. > 3) No code changes to the individual patches themselves. > > Changes since RFC v1: > ===================== > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > Thanks Barry! > 2) Addressed some of the code review comments that Nhat Pham provided in > Ryan's initial RFC [1]: > - Added a comment about the cgroup zswap limit checks occuring once per > folio at the beginning of zswap_store(). > Nhat, Ryan, please do let me know if the comments convey the summary > from the RFC discussion. Thanks! > - Posted data on running the cgroup suite's zswap kselftest. > 3) Rebased to v6.11-rc3. > 4) Gathered performance data with usemem and the rebased patch-series. > > Performance Testing: > ==================== > Testing of this patch-series was done with the v6.11-rc3 mainline, without > and with this patch-series, on an Intel Sapphire Rapids server, > dual-socket 56 cores per socket, 4 IAA devices per socket. > > The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as the > backing swap device for ZSWAP. zstd is configured as the ZRAM compressor. > Core frequency was fixed at 2500MHz. > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > was fixed at 40G. The is no swap limit set for the cgroup. Following a > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting" > series [2], 70 usemem processes were run, each allocating and writing 1G of > memory: > > usemem --init-time -w -O -n 70 1g > > The vm/sysfs mTHP stats included with the performance data provide details > on the swapout activity to ZSWAP/swap. > > Other kernel configuration parameters: > > ZSWAP Compressors : zstd, deflate-iaa > ZSWAP Allocator : zsmalloc > SWAP page-cluster : 2 > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > IAA "compression verification" is enabled. Hence each IAA compression > will be decompressed internally by the "iaa_crypto" driver, the crc-s > returned by the hardware will be compared and errors reported in case of > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > compared to the software compressors. > > Throughput is derived by averaging the individual 70 processes' throughputs > reported by usemem. sys time is measured with perf. All data points are > averaged across 3 runs. > > 64KB mTHP (cgroup memory.high set to 40G): > ========================================== > > ------------------------------------------------------------------------------ > v6.11-rc3 mainline zswap-mTHP Change wrt > Baseline Baseline > ------------------------------------------------------------------------------ > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > iaa iaa iaa > ------------------------------------------------------------------------------ > Throughput (KB/s) 161,496 156,343 140,363 151,938 -13% -3% > sys time (sec) 771.68 802.08 954.85 735.47 -24% 8% > memcg_high 111,223 110,889 138,651 133,884 > memcg_swap_high 0 0 0 0 > memcg_swap_fail 0 0 0 0 > pswpin 16 16 0 0 > pswpout 7,471,472 7,527,963 0 0 > zswpin 635 605 624 639 > zswpout 1,509 1,478 9,453,761 9,385,910 > thp_swpout 0 0 0 0 > thp_swpout_ 0 0 0 0 > fallback > pgmajfault 3,616 3,430 4,633 3,611 > ZSWPOUT-64kB n/a n/a 590,768 586,521 > SWPOUT-64kB 466,967 470,498 0 0 > ------------------------------------------------------------------------------ > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G): > ======================================================= > > ------------------------------------------------------------------------------ > v6.11-rc3 mainline zswap-mTHP Change wrt > Baseline Baseline > ------------------------------------------------------------------------------ > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > iaa iaa iaa > ------------------------------------------------------------------------------ > Throughput (KB/s) 192,164 194,643 165,005 174,536 -14% -10% > sys time (sec) 823.55 830.42 801.72 676.65 3% 19% > memcg_high 16,054 15,936 14,951 16,096 > memcg_swap_high 0 0 0 0 > memcg_swap_fail 0 0 0 0 > pswpin 0 0 0 0 > pswpout 8,629,248 8,628,907 0 0 > zswpin 560 645 5,333 781 > zswpout 1,416 1,503 8,546,895 9,355,760 > thp_swpout 16,854 16,853 0 0 > thp_swpout_ 0 0 0 0 > fallback > pgmajfault 3,341 3,574 8,139 3,582 > ZSWPOUT-2048kB n/a n/a 16,684 18,270 > SWPOUT-2048kB 16,854 16,853 0 0 > ------------------------------------------------------------------------------ > > In the "Before" scenario, when zswap does not store mTHP, only allocations > count towards the cgroup memory limit. However, in the "After" scenario, > with the introduction of zswap_store() mTHP, both, allocations as well as > the zswap compressed pool usage from all 70 processes are counted towards > the memory limit. As a result, we see higher swapout activity in the > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup > charge leads to more frequent memory.high breaches. > > This causes degradation in throughput and sys time with zswap mTHP, more so > in case of zstd than deflate-iaa. Compress latency could play a part in > this - when there is more swapout activity happening, a slower compressor > would cause allocations to stall for any/all of the 70 processes. > > In my opinion, even though the test set up does not provide an accurate > way for a direct before/after comparison (because of zswap usage being > counted in cgroup, hence towards the memory.high), it still seems > reasonable for zswap_store to support (m)THP, so that further performance > improvements can be implemented. Are you saying that in the "Before" data we end up skipping zswap completely because of using mTHPs? Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data to force the mTHPs to be split and for the data to be stored in zswap? This would be a more fair Before/After comparison where the memory goes to zswap in both cases, but "Before" has to be split because of zswap's lack of support for mTHP. I assume most setups relying on zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe not. Nhat, is this something you can share?