Hi Yosry, > -----Original Message----- > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > Sent: Wednesday, August 28, 2024 6:02 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; ryan.roberts@xxxxxxx; > Huang, Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>; Chris > Li <chrisl@xxxxxxxxxx> > Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios > > [..] > > > > In the "Before" scenario, when zswap does not store mTHP, only > allocations > > > > count towards the cgroup memory limit. However, in the "After" > scenario, > > > > with the introduction of zswap_store() mTHP, both, allocations as well as > > > > the zswap compressed pool usage from all 70 processes are counted > > > towards > > > > the memory limit. As a result, we see higher swapout activity in the > > > > "After" data. Hence, more time is spent doing reclaim as the zswap > cgroup > > > > charge leads to more frequent memory.high breaches. > > > > > > > > This causes degradation in throughput and sys time with zswap mTHP, > more > > > so > > > > in case of zstd than deflate-iaa. Compress latency could play a part in > > > > this - when there is more swapout activity happening, a slower > compressor > > > > would cause allocations to stall for any/all of the 70 processes. > > > > > > > > In my opinion, even though the test set up does not provide an accurate > > > > way for a direct before/after comparison (because of zswap usage being > > > > counted in cgroup, hence towards the memory.high), it still seems > > > > reasonable for zswap_store to support (m)THP, so that further > performance > > > > improvements can be implemented. > > > > > > Are you saying that in the "Before" data we end up skipping zswap > > > completely because of using mTHPs? > > > > That's right, Yosry. > > > > > > > > Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data > > > > We could do this, however I am not sure if turning off CONFIG_THP_SWAP > > will have other side-effects in terms of disabling mm code paths outside of > > zswap that are intended to be mTHP optimizations that could again skew > > the before/after comparisons. > > Yeah that's possible, but right now we are testing mTHP swapout that > does not go through zswap at all vs. mTHP swapout going through zswap. > > I think what we really want to measure is 4K swapout going through > zswap vs. mTHP swapout going through zswap. This assumes that current > zswap setups disable CONFIG_THP_SWAP, so we would be measuring the > benefit of allowing them to enable CONFIG_THP_SWAP by supporting it > properly in zswap. > > If some setups with zswap have CONFIG_THP_SWAP enabled then that's a > different story, but we already have the data for this case as well > right now in case this is a legitimate setup. > > Adding Chris Li here from Google. We have CONFIG_THP_SWAP disabled > with zswap, so for us we would want to know the benefit of supporting > CONFIG_THP_SWAP properly in zswap. At least I think so :) Sure, this makes sense. Here's the data that I gathered with CONFIG_THP_SWAP disabled. We see improvements overall in throughput and sys time for zstd and deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y): 64K mTHP: ========= ------------------------------------------------------------------------------- v6.11-rc3 mainline zswap-mTHP Change wrt Baseline Baseline CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y -------------------------------------------------------------------------------- ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- iaa iaa iaa ------------------------------------------------------------------------------- Throughput (KB/s) 136,113 140,044 140,363 151,938 3% 8% sys time (sec) 986.78 951.95 954.85 735.47 3% 23% memcg_high 124,183 127,513 138,651 133,884 memcg_swap_high 0 0 0 0 memcg_swap_fail 619,020 751,099 0 0 pswpin 0 0 0 0 pswpout 0 0 0 0 zswpin 656 569 624 639 zswpout 9,413,603 11,284,812 9,453,761 9,385,910 thp_swpout 0 0 0 0 thp_swpout_ 0 0 0 0 fallback pgmajfault 3,470 3,382 4,633 3,611 ZSWPOUT-64kB n/a n/a 590,768 586,521 SWPOUT-64kB 0 0 0 0 ------------------------------------------------------------------------------- 2M THP: ======= ------------------------------------------------------------------------------ v6.11-rc3 mainline zswap-mTHP Change wrt Baseline Baseline CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y ------------------------------------------------------------------------------ ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- iaa iaa iaa ------------------------------------------------------------------------------ Throughput (KB/s) 164,220 172,523 165,005 174,536 0.5% 1% sys time (sec) 855.76 686.94 801.72 676.65 6% 1% memcg_high 14,628 16,247 14,951 16,096 memcg_swap_high 0 0 0 0 memcg_swap_fail 18,698 21,114 0 0 pswpin 0 0 0 0 pswpout 0 0 0 0 zswpin 663 665 5,333 781 zswpout 8,419,458 8,992,065 8,546,895 9,355,760 thp_swpout 0 0 0 0 thp_swpout_ 18,697 21,113 0 0 fallback pgmajfault 3,439 3,496 8,139 3,582 ZSWPOUT-2048kB n/a n/a 16,684 18,270 SWPOUT-2048kB 0 0 0 0 ----------------------------------------------------------------------------- Thanks, Kanchana > > > > > Will wait for Nhat's comments as well. > > > > Thanks, > > Kanchana > > > > > to force the mTHPs to be split and for the data to be stored in zswap? > > > This would be a more fair Before/After comparison where the memory > > > goes to zswap in both cases, but "Before" has to be split because of > > > zswap's lack of support for mTHP. I assume most setups relying on > > > zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe > not. > > > Nhat, is this something you can share?