"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> writes: > Hi Ying, > >> -----Original Message----- >> From: Huang, Ying <ying.huang@xxxxxxxxx> >> Sent: Sunday, August 18, 2024 8:17 PM >> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> >> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; >> hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; nphamcs@xxxxxxxxx; >> ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; >> Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K >> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> >> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios >> >> Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> writes: >> >> [snip] >> >> > >> > Performance Testing: >> > ==================== >> > Testing of this patch-series was done with the v6.11-rc3 mainline, without >> > and with this patch-series, on an Intel Sapphire Rapids server, >> > dual-socket 56 cores per socket, 4 IAA devices per socket. >> > >> > The system has 503 GiB RAM, with a 4G SSD as the backing swap device for >> > ZSWAP. Core frequency was fixed at 2500MHz. >> > >> > The vm-scalability "usemem" test was run in a cgroup whose memory.high >> > was fixed. Following a similar methodology as in Ryan Roberts' >> > "Swap-out mTHP without splitting" series [2], 70 usemem processes were >> > run, each allocating and writing 1G of memory: >> > >> > usemem --init-time -w -O -n 70 1g >> > >> > Since I was constrained to get the 70 usemem processes to generate >> > swapout activity with the 4G SSD, I ended up using different cgroup >> > memory.high fixed limits for the experiments with 64K mTHP and 2M THP: >> > >> > 64K mTHP experiments: cgroup memory fixed at 60G >> > 2M THP experiments : cgroup memory fixed at 55G >> > >> > The vm/sysfs stats included after the performance data provide details >> > on the swapout activity to SSD/ZSWAP. >> > >> > Other kernel configuration parameters: >> > >> > ZSWAP Compressor : LZ4, DEFLATE-IAA >> > ZSWAP Allocator : ZSMALLOC >> > SWAP page-cluster : 2 >> > >> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, >> > IAA "compression verification" is enabled. Hence each IAA compression >> > will be decompressed internally by the "iaa_crypto" driver, the crc-s >> > returned by the hardware will be compared and errors reported in case of >> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as >> > compared to the software compressors. >> > >> > Throughput reported by usemem and perf sys time for running the test >> > are as follows, averaged across 3 runs: >> > >> > 64KB mTHP (cgroup memory.high set to 60G): >> > ========================================== >> > ------------------------------------------------------------------ >> > | | | | | >> > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| >> > | | | KB/s | | >> > |--------------------|-------------------|------------|------------| >> > |v6.11-rc3 mainline | SSD | 335,346 | Baseline | >> > |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% | >> >> zswap throughput is worse than ssd swap? This doesn't look right. > > I realize it might look that way, however, this is not an apples-to-apples comparison, > as explained in the latter part of my analysis (after the 2M THP data tables). > The primary reason for this is because of running the test under a fixed > cgroup memory limit. > > In the "Before" scenario, mTHP get swapped out to SSD. However, the disk swap > usage is not accounted towards checking if the cgroup's memory limit has been > exceeded. Hence there are relatively fewer swap-outs, resulting mainly from the > 1G allocations from each of the 70 usemem processes working with a 60G memory > limit on the parent cgroup. > > However, the picture changes in the "After" scenario. mTHPs will now get stored in > zswap, which is accounted for in the cgroup's memory.current and counts > towards the fixed memory limit in effect for the parent cgroup. As a result, when > mTHP get stored in zswap, the mTHP compressed data in the zswap zpool now > count towards the cgroup's active memory and memory limit. This is in addition > to the 1G allocations from each of the 70 processes. > > As you can see, this creates more memory pressure on the cgroup, resulting in > more swap-outs. With lz4 as the zswap compressor, this results in lesser throughput > wrt "Before". > > However, with IAA as the zswap compressor, the throughout with zswap mTHP is > better than "Before" because of better hardware compress latencies, which handle > the higher swap-out activity without compromising on throughput. > >> >> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 388,154 | 16% | >> > |------------------------------------------------------------------| >> > | | | | | >> > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| >> > | | | sec | | >> > |--------------------|-------------------|------------|------------| >> > |v6.11-rc3 mainline | SSD | 91.37 | Baseline | >> > |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% | >> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 235.60 | -158% | >> > ------------------------------------------------------------------ >> > >> > ----------------------------------------------------------------------- >> > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap- >> mTHP | >> > | | mainline | Store | Store | >> > | | | lz4 | deflate-iaa | >> > |-----------------------------------------------------------------------| >> > | pswpin | 0 | 0 | 0 | >> > | pswpout | 174,432 | 0 | 0 | >> > | zswpin | 703 | 534 | 721 | >> > | zswpout | 1,501 | 1,491,654 | 1,398,805 | >> >> It appears that the number of swapped pages for zswap is much larger >> than that of SSD swap. Why? I guess this is why zswap throughput is >> worse. > > Your observation is correct. I hope the above explanation helps as to the > reasoning behind this. Before: (174432 + 1501) * 4 / 1024 = 687.2 MB After: 1491654 * 4.0 / 1024 = 5826.8 MB