Hi Nhat, > -----Original Message----- > From: Nhat Pham <nphamcs@xxxxxxxxx> > Sent: Wednesday, August 21, 2024 7:43 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; ryan.roberts@xxxxxxx; > Huang, Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > On Sun, Aug 18, 2024 at 10:16 PM Kanchana P Sridhar > <kanchana.p.sridhar@xxxxxxxxx> wrote: > > > > Hi All, > > > > This patch-series enables zswap_store() to accept and store mTHP > > folios. The most significant contribution in this series is from the > > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > > migrated to v6.11-rc3 in patch 2/4 of this series. > > > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > > https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@xxxxxxx/T/#u > > > > Additionally, there is an attempt to modularize some of the functionality > > in zswap_store(), to make it more amenable to supporting any-order > > mTHPs. > > > > For instance, the determination of whether a folio is same-filled is > > based on mapping an index into the folio to derive the page. Likewise, > > there is a function "zswap_store_entry" added to store a zswap_entry in > > the xarray. > > > > For accounting purposes, the patch-series adds per-order mTHP sysfs > > "zswpout" counters that get incremented upon successful zswap_store of > > an mTHP folio: > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > > > This patch-series is a precursor to ZSWAP compress batching of mTHP > > swap-out and decompress batching of swap-ins based on > swapin_readahead(), > > using Intel IAA hardware acceleration, which we would like to submit in > > subsequent RFC patch-series, with performance improvement data. > > > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > > > Changes since v3: > > ================= > > 1) Rebased to mm-unstable commit > 8c0b4f7b65fd1ca7af01267f491e815a40d77444. > > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > > changes to count_mthp_stat() so that it's always defined, even when THP > > is disabled. Barry, I have also made one other change in page_io.c > > where count_mthp_stat() is called by count_swpout_vm_event(). I would > > appreciate it if you can review this. Thanks! > > Hopefully this should resolve the kernel robot build errors. > > > > Changes since v2: > > ================= > > 1) Gathered usemem data using SSD as the backing swap device for zswap, > > as suggested by Ying Huang. Ying, I would appreciate it if you can > > review the latest data. Thanks! > > 2) Generated the base commit info in the patches to attempt to address > > the kernel test robot build errors. > > 3) No code changes to the individual patches themselves. > > > > Changes since RFC v1: > > ===================== > > > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > > Thanks Barry! > > 2) Addressed some of the code review comments that Nhat Pham provided > in > > Ryan's initial RFC [1]: > > - Added a comment about the cgroup zswap limit checks occuring once > per > > folio at the beginning of zswap_store(). > > Nhat, Ryan, please do let me know if the comments convey the summary > > from the RFC discussion. Thanks! > > - Posted data on running the cgroup suite's zswap kselftest. > > 3) Rebased to v6.11-rc3. > > 4) Gathered performance data with usemem and the rebased patch-series. > > > > Performance Testing: > > ==================== > > Testing of this patch-series was done with the v6.11-rc3 mainline, without > > and with this patch-series, on an Intel Sapphire Rapids server, > > dual-socket 56 cores per socket, 4 IAA devices per socket. > > > > The system has 503 GiB RAM, with a 4G SSD as the backing swap device for > > ZSWAP. Core frequency was fixed at 2500MHz. > > > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > > was fixed. Following a similar methodology as in Ryan Roberts' > > "Swap-out mTHP without splitting" series [2], 70 usemem processes were > > run, each allocating and writing 1G of memory: > > > > usemem --init-time -w -O -n 70 1g > > > > Since I was constrained to get the 70 usemem processes to generate > > swapout activity with the 4G SSD, I ended up using different cgroup > > memory.high fixed limits for the experiments with 64K mTHP and 2M THP: > > > > 64K mTHP experiments: cgroup memory fixed at 60G > > 2M THP experiments : cgroup memory fixed at 55G > > > > The vm/sysfs stats included after the performance data provide details > > on the swapout activity to SSD/ZSWAP. > > > > Other kernel configuration parameters: > > > > ZSWAP Compressor : LZ4, DEFLATE-IAA > > ZSWAP Allocator : ZSMALLOC > > SWAP page-cluster : 2 > > > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > > IAA "compression verification" is enabled. Hence each IAA compression > > will be decompressed internally by the "iaa_crypto" driver, the crc-s > > returned by the hardware will be compared and errors reported in case of > > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > > compared to the software compressors. > > > > Throughput reported by usemem and perf sys time for running the test > > are as follows, averaged across 3 runs: > > > > 64KB mTHP (cgroup memory.high set to 60G): > > ========================================== > > ------------------------------------------------------------------ > > | | | | | > > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > > | | | KB/s | | > > |--------------------|-------------------|------------|------------| > > |v6.11-rc3 mainline | SSD | 335,346 | Baseline | > > |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% | > > |zswap-mTHP-Store | ZSWAP deflate-iaa | 388,154 | 16% | > > |------------------------------------------------------------------| > > | | | | | > > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > > | | | sec | | > > |--------------------|-------------------|------------|------------| > > |v6.11-rc3 mainline | SSD | 91.37 | Baseline | > > |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% | > > |zswap-mTHP-Store | ZSWAP deflate-iaa | 235.60 | -158% | > > ------------------------------------------------------------------ > > Yeah no, this is not good. That throughput regression is concerning... > > Is this tied to lz4 only, or do you observe similar trends in other > compressors that are not deflate-iaa? Let me gather data with other software compressors. > > > > > > ----------------------------------------------------------------------- > > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap- > mTHP | > > | | mainline | Store | Store | > > | | | lz4 | deflate-iaa | > > |-----------------------------------------------------------------------| > > | pswpin | 0 | 0 | 0 | > > | pswpout | 174,432 | 0 | 0 | > > | zswpin | 703 | 534 | 721 | > > | zswpout | 1,501 | 1,491,654 | 1,398,805 | > > |-----------------------------------------------------------------------| > > | thp_swpout | 0 | 0 | 0 | > > | thp_swpout_fallback | 0 | 0 | 0 | > > | pgmajfault | 3,364 | 3,650 | 3,431 | > > |-----------------------------------------------------------------------| > > | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 | > > |-----------------------------------------------------------------------| > > | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 | > > ----------------------------------------------------------------------- > > > > Yeah this is not good. Something fishy is going on, if we see this > ginormous jump from 175000 (z)swpout pages to almost 1.5 million > pages. That's a massive jump. > > Either it's: > > 1.Your theory - zswap store keeps banging on the limit (which suggests > incompatibility between the way zswap currently behaves and our > reclaim logic) > > 2. The data here is ridiculously incompressible. We're needing to > zswpout roughly 8.5 times the number of pages, so the saving is 8.5 > less => we only save 11.76% of memory for each page??? That's not > right... > > 3. There's an outright bug somewhere. > > Very suspicious. > > > > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 55G): > > ======================================================= > > ------------------------------------------------------------------ > > | | | | | > > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > > | | | KB/s | | > > |--------------------|-------------------|------------|------------| > > |v6.11-rc3 mainline | SSD | 190,827 | Baseline | > > |zswap-mTHP-Store | ZSWAP lz4 | 32,026 | -83% | > > |zswap-mTHP-Store | ZSWAP deflate-iaa | 203,772 | 7% | > > |------------------------------------------------------------------| > > | | | | | > > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > > | | | sec | | > > |--------------------|-------------------|------------|------------| > > |v6.11-rc3 mainline | SSD | 27.23 | Baseline | > > |zswap-mTHP-Store | ZSWAP lz4 | 156.52 | -475% | > > |zswap-mTHP-Store | ZSWAP deflate-iaa | 171.45 | -530% | > > ------------------------------------------------------------------ > > I'm confused. This is a *regression* right? A massive one that is - > sys time is *more* than 5 times the old value? > > > > > ------------------------------------------------------------------------- > > | VMSTATS, mTHP ZSWAP/SSD stats | v6.11-rc3 | zswap-mTHP | zswap- > mTHP | > > | | mainline | Store | Store | > > | | | lz4 | deflate-iaa | > > |-------------------------------------------------------------------------| > > | pswpin | 0 | 0 | 0 | > > | pswpout | 797,184 | 0 | 0 | > > | zswpin | 690 | 649 | 669 | > > | zswpout | 1,465 | 1,596,382 | 1,540,766 | > > |-------------------------------------------------------------------------| > > | thp_swpout | 1,557 | 0 | 0 | > > | thp_swpout_fallback | 0 | 3,248 | 3,752 | > > This is also increased, but I supposed we're just doing more > (z)swapping out in general... > > > | pgmajfault | 3,726 | 6,470 | 5,691 | > > |-------------------------------------------------------------------------| > > | hugepages-2048kB/stats/zswpout | | 2,416 | 2,261 | > > |-------------------------------------------------------------------------| > > | hugepages-2048kB/stats/swpout | 1,557 | 0 | 0 | > > ------------------------------------------------------------------------- > > > > I'm not trying to delay this patch - I fully believe in supporting > zswap for larger pages (both mTHP and THP - whatever the memory > reclaim subsystem throws at us). > > But we need to get to the bottom of this :) These are very suspicious > and concerning data. If this is something urgent, I can live with a > gate to enable/disable this, but I'd much prefer we understand what's > going on here. Thanks for this analysis. I will debug this some more, so we can better understand these results. Thanks, Kanchana