> -----Original Message----- > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > Sent: Wednesday, August 28, 2024 3:37 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; ryan.roberts@xxxxxxx; > Huang, Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios > > On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar > <kanchana.p.sridhar@xxxxxxxxx> wrote: > > > > Hi All, > > > > This patch-series enables zswap_store() to accept and store mTHP > > folios. The most significant contribution in this series is from the > > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > > migrated to v6.11-rc3 in patch 2/4 of this series. > > > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > > https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@xxxxxxx/T/#u > > > > Additionally, there is an attempt to modularize some of the functionality > > in zswap_store(), to make it more amenable to supporting any-order > > mTHPs. For instance, the function zswap_store_entry() stores a > zswap_entry > > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to > > delete all offsets corresponding to a higher order folio stored in zswap. > > > > For accounting purposes, the patch-series adds per-order mTHP sysfs > > "zswpout" counters that get incremented upon successful zswap_store of > > an mTHP folio: > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > > > This patch-series is a precursor to ZSWAP compress batching of mTHP > > swap-out and decompress batching of swap-ins based on > swapin_readahead(), > > using Intel IAA hardware acceleration, which we would like to submit in > > subsequent RFC patch-series, with performance improvement data. > > > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > > > Changes since v4: > > ================= > > 1) Published before/after data with zstd, as suggested by Nhat (Thanks > > Nhat for the data reviews!). > > 2) Rebased to mm-unstable from 8/27/2024, > > commit b659edec079c90012cf8d05624e312d1062b8b87. > > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if > > CONFIG_MEMCG is not defined, to resolve build errors reported by kernel > > robot; as per Nhat's and Michal's suggestion to not require a separate > > patch to fix the build errors (thanks both!). > > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as > > suggested by Yosry (Thanks Yosry!). > > 5) Squashed the commits that define new mthp zswpout stat counters, and > > invoke count_mthp_stat() after successful zswap_store()s; into a single > > commit. Thanks Yosry for this suggestion! > > > > Changes since v3: > > ================= > > 1) Rebased to mm-unstable commit > 8c0b4f7b65fd1ca7af01267f491e815a40d77444. > > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > > changes to count_mthp_stat() so that it's always defined, even when THP > > is disabled. Barry, I have also made one other change in page_io.c > > where count_mthp_stat() is called by count_swpout_vm_event(). I would > > appreciate it if you can review this. Thanks! > > Hopefully this should resolve the kernel robot build errors. > > > > Changes since v2: > > ================= > > 1) Gathered usemem data using SSD as the backing swap device for zswap, > > as suggested by Ying Huang. Ying, I would appreciate it if you can > > review the latest data. Thanks! > > 2) Generated the base commit info in the patches to attempt to address > > the kernel test robot build errors. > > 3) No code changes to the individual patches themselves. > > > > Changes since RFC v1: > > ===================== > > > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > > Thanks Barry! > > 2) Addressed some of the code review comments that Nhat Pham provided > in > > Ryan's initial RFC [1]: > > - Added a comment about the cgroup zswap limit checks occuring once > per > > folio at the beginning of zswap_store(). > > Nhat, Ryan, please do let me know if the comments convey the summary > > from the RFC discussion. Thanks! > > - Posted data on running the cgroup suite's zswap kselftest. > > 3) Rebased to v6.11-rc3. > > 4) Gathered performance data with usemem and the rebased patch-series. > > > > Performance Testing: > > ==================== > > Testing of this patch-series was done with the v6.11-rc3 mainline, without > > and with this patch-series, on an Intel Sapphire Rapids server, > > dual-socket 56 cores per socket, 4 IAA devices per socket. > > > > The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as > the > > backing swap device for ZSWAP. zstd is configured as the ZRAM compressor. > > Core frequency was fixed at 2500MHz. > > > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > > was fixed at 40G. The is no swap limit set for the cgroup. Following a > > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting" > > series [2], 70 usemem processes were run, each allocating and writing 1G of > > memory: > > > > usemem --init-time -w -O -n 70 1g > > > > The vm/sysfs mTHP stats included with the performance data provide > details > > on the swapout activity to ZSWAP/swap. > > > > Other kernel configuration parameters: > > > > ZSWAP Compressors : zstd, deflate-iaa > > ZSWAP Allocator : zsmalloc > > SWAP page-cluster : 2 > > > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > > IAA "compression verification" is enabled. Hence each IAA compression > > will be decompressed internally by the "iaa_crypto" driver, the crc-s > > returned by the hardware will be compared and errors reported in case of > > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > > compared to the software compressors. > > > > Throughput is derived by averaging the individual 70 processes' throughputs > > reported by usemem. sys time is measured with perf. All data points are > > averaged across 3 runs. > > > > 64KB mTHP (cgroup memory.high set to 40G): > > ========================================== > > > > ------------------------------------------------------------------------------ > > v6.11-rc3 mainline zswap-mTHP Change wrt > > Baseline Baseline > > ------------------------------------------------------------------------------ > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------ > > Throughput (KB/s) 161,496 156,343 140,363 151,938 -13% -3% > > sys time (sec) 771.68 802.08 954.85 735.47 -24% 8% > > memcg_high 111,223 110,889 138,651 133,884 > > memcg_swap_high 0 0 0 0 > > memcg_swap_fail 0 0 0 0 > > pswpin 16 16 0 0 > > pswpout 7,471,472 7,527,963 0 0 > > zswpin 635 605 624 639 > > zswpout 1,509 1,478 9,453,761 9,385,910 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > pgmajfault 3,616 3,430 4,633 3,611 > > ZSWPOUT-64kB n/a n/a 590,768 586,521 > > SWPOUT-64kB 466,967 470,498 0 0 > > ------------------------------------------------------------------------------ > > > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G): > > ======================================================= > > > > ------------------------------------------------------------------------------ > > v6.11-rc3 mainline zswap-mTHP Change wrt > > Baseline Baseline > > ------------------------------------------------------------------------------ > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------ > > Throughput (KB/s) 192,164 194,643 165,005 174,536 -14% -10% > > sys time (sec) 823.55 830.42 801.72 676.65 3% 19% > > memcg_high 16,054 15,936 14,951 16,096 > > memcg_swap_high 0 0 0 0 > > memcg_swap_fail 0 0 0 0 > > pswpin 0 0 0 0 > > pswpout 8,629,248 8,628,907 0 0 > > zswpin 560 645 5,333 781 > > zswpout 1,416 1,503 8,546,895 9,355,760 > > thp_swpout 16,854 16,853 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > pgmajfault 3,341 3,574 8,139 3,582 > > ZSWPOUT-2048kB n/a n/a 16,684 18,270 > > SWPOUT-2048kB 16,854 16,853 0 0 > > ------------------------------------------------------------------------------ > > > > In the "Before" scenario, when zswap does not store mTHP, only allocations > > count towards the cgroup memory limit. However, in the "After" scenario, > > with the introduction of zswap_store() mTHP, both, allocations as well as > > the zswap compressed pool usage from all 70 processes are counted > towards > > the memory limit. As a result, we see higher swapout activity in the > > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup > > charge leads to more frequent memory.high breaches. > > > > This causes degradation in throughput and sys time with zswap mTHP, more > so > > in case of zstd than deflate-iaa. Compress latency could play a part in > > this - when there is more swapout activity happening, a slower compressor > > would cause allocations to stall for any/all of the 70 processes. > > > > In my opinion, even though the test set up does not provide an accurate > > way for a direct before/after comparison (because of zswap usage being > > counted in cgroup, hence towards the memory.high), it still seems > > reasonable for zswap_store to support (m)THP, so that further performance > > improvements can be implemented. > > Are you saying that in the "Before" data we end up skipping zswap > completely because of using mTHPs? That's right, Yosry. > > Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data We could do this, however I am not sure if turning off CONFIG_THP_SWAP will have other side-effects in terms of disabling mm code paths outside of zswap that are intended to be mTHP optimizations that could again skew the before/after comparisons. Will wait for Nhat's comments as well. Thanks, Kanchana > to force the mTHPs to be split and for the data to be stored in zswap? > This would be a more fair Before/After comparison where the memory > goes to zswap in both cases, but "Before" has to be split because of > zswap's lack of support for mTHP. I assume most setups relying on > zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe not. > Nhat, is this something you can share?