Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Yosry Ahmed <yosryahmed@xxxxxxxxxx> writes:

> On Thu, Aug 29, 2024 at 2:27 PM Kanchana P Sridhar
> <kanchana.p.sridhar@xxxxxxxxx> wrote:
>>
>> Hi All,
>>
>> This patch-series enables zswap_store() to accept and store mTHP
>> folios. The most significant contribution in this series is from the
>> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
>> migrated to v6.11-rc3 in patch 2/4 of this series.
>>
>> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
>>      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@xxxxxxx/T/#u
>>
>> Additionally, there is an attempt to modularize some of the functionality
>> in zswap_store(), to make it more amenable to supporting any-order
>> mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
>> in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
>> delete all offsets corresponding to a higher order folio stored in zswap.
>>
>> For accounting purposes, the patch-series adds per-order mTHP sysfs
>> "zswpout" counters that get incremented upon successful zswap_store of
>> an mTHP folio:
>>
>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>>
>> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
>> will enable/disable zswap storing of (m)THP. When disabled, zswap will
>> fallback to rejecting the mTHP folio, to be processed by the backing
>> swap device.
>>
>> This patch-series is a precursor to ZSWAP compress batching of mTHP
>> swap-out and decompress batching of swap-ins based on swapin_readahead(),
>> using Intel IAA hardware acceleration, which we would like to submit in
>> subsequent RFC patch-series, with performance improvement data.
>>
>> Thanks to Ying Huang for pre-posting review feedback and suggestions!
>>
>> Thanks also to Nhat, Yosry and Barry for their helpful feedback, data
>> reviews and suggestions!
>>
>> Changes since v5:
>> =================
>> 1) Rebased to mm-unstable as of 8/29/2024,
>>    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
>> 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
>>    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
>>    suggestion to add a knob by which users can enable/disable this
>>    change. Nhat, I hope this is along the lines of what you were
>>    thinking.
>> 3) Added vm-scalability usemem data with 4K folios with
>>    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure
>>    there is no regression with this change.
>> 4) Added data with usemem with 64K and 2M THP for an alternate view of
>>    before/after, as suggested by Yosry, so we can understand the impact
>>    of when mTHPs are split into 4K folios in shrink_folio_list()
>>    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
>>    in zswap. Thanks Yosry for this suggestion.
>>
>> Changes since v4:
>> =================
>> 1) Published before/after data with zstd, as suggested by Nhat (Thanks
>>    Nhat for the data reviews!).
>> 2) Rebased to mm-unstable from 8/27/2024,
>>    commit b659edec079c90012cf8d05624e312d1062b8b87.
>> 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
>>    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
>>    robot; as per Nhat's and Michal's suggestion to not require a separate
>>    patch to fix the build errors (thanks both!).
>> 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
>>    suggested by Yosry (Thanks Yosry!).
>> 5) Squashed the commits that define new mthp zswpout stat counters, and
>>    invoke count_mthp_stat() after successful zswap_store()s; into a single
>>    commit. Thanks Yosry for this suggestion!
>>
>> Changes since v3:
>> =================
>> 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
>>    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
>>    changes to count_mthp_stat() so that it's always defined, even when THP
>>    is disabled. Barry, I have also made one other change in page_io.c
>>    where count_mthp_stat() is called by count_swpout_vm_event(). I would
>>    appreciate it if you can review this. Thanks!
>>    Hopefully this should resolve the kernel robot build errors.
>>
>> Changes since v2:
>> =================
>> 1) Gathered usemem data using SSD as the backing swap device for zswap,
>>    as suggested by Ying Huang. Ying, I would appreciate it if you can
>>    review the latest data. Thanks!
>> 2) Generated the base commit info in the patches to attempt to address
>>    the kernel test robot build errors.
>> 3) No code changes to the individual patches themselves.
>>
>> Changes since RFC v1:
>> =====================
>>
>> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
>>    Thanks Barry!
>> 2) Addressed some of the code review comments that Nhat Pham provided in
>>    Ryan's initial RFC [1]:
>>    - Added a comment about the cgroup zswap limit checks occuring once per
>>      folio at the beginning of zswap_store().
>>      Nhat, Ryan, please do let me know if the comments convey the summary
>>      from the RFC discussion. Thanks!
>>    - Posted data on running the cgroup suite's zswap kselftest.
>> 3) Rebased to v6.11-rc3.
>> 4) Gathered performance data with usemem and the rebased patch-series.
>>
>>
>> Regression Testing:
>> ===================
>> I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
>> folios with mm-unstable and with this patch-series. The main goal was
>> to make sure that there is no functional or performance regression
>> wrt the earlier zswap behavior for 4K folios,
>> CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of 4K
>> pages goes through the newly added code path [zswap_store(),
>> zswap_store_page()].
>>
>> The data indicates there is no regression.
>>
>>  ------------------------------------------------------------------------------
>>                      mm-unstable 8-28-2024                        zswap-mTHP v6
>>                                               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
>>                                                                      is not set
>>  ------------------------------------------------------------------------------
>>  ZSWAP compressor        zstd     deflate-                     zstd    deflate-
>>                                        iaa                                  iaa
>>  ------------------------------------------------------------------------------
>>  Throughput (KB/s)    110,775      113,010               111,550        121,937
>>  sys time (sec)      1,141.72       954.87              1,131.95         828.47
>>  memcg_high           140,500      153,737               139,772        134,129
>>  memcg_swap_high            0            0                     0              0
>>  memcg_swap_fail            0            0                     0              0
>>  pswpin                     0            0                     0              0
>>  pswpout                    0            0                     0              0
>>  zswpin                   675          690                   682            684
>>  zswpout            9,552,298   10,603,271             9,566,392      9,267,213
>>  thp_swpout                 0            0                     0              0
>>  thp_swpout_                0            0                     0              0
>>   fallback
>>  pgmajfault             3,453        3,468                 3,841          3,487
>>  ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
>>  SWPOUT-64kB-mTHP           0            0                     0              0
>>  ------------------------------------------------------------------------------
>>
>>
>> Performance Testing:
>> ====================
>> Testing of this patch-series was done with the v6.11-rc3 mainline, without
>> and with this patch-series, on an Intel Sapphire Rapids server,
>> dual-socket 56 cores per socket, 4 IAA devices per socket.
>>
>> The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as the
>> backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
>> Core frequency was fixed at 2500MHz.
>>
>> The vm-scalability "usemem" test was run in a cgroup whose memory.high
>> was fixed at 40G. The is no swap limit set for the cgroup. Following a
>> similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
>> series [2], 70 usemem processes were run, each allocating and writing 1G of
>> memory:
>>
>>     usemem --init-time -w -O -n 70 1g
>>
>> The vm/sysfs mTHP stats included with the performance data provide details
>> on the swapout activity to ZSWAP/swap.
>>
>> Other kernel configuration parameters:
>>
>>     ZSWAP Compressors : zstd, deflate-iaa
>>     ZSWAP Allocator   : zsmalloc
>>     SWAP page-cluster : 2
>>
>> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
>> IAA "compression verification" is enabled. Hence each IAA compression
>> will be decompressed internally by the "iaa_crypto" driver, the crc-s
>> returned by the hardware will be compared and errors reported in case of
>> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
>> compared to the software compressors.
>>
>> Throughput is derived by averaging the individual 70 processes' throughputs
>> reported by usemem. sys time is measured with perf. All data points are
>> averaged across 3 runs.
>>
>> Case 1: Baseline with CONFIG_THP_SWAP turned off, and mTHP is split in reclaim.
>> ===============================================================================
>>
>> In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
>> 64K/2M (m)THP to be split, and only 4K folios processed by zswap.
>>
>> The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results
>> in 64K/2M (m)THP to not be split, and processed by zswap.
>>
>>  64KB mTHP (cgroup memory.high set to 40G):
>>  ==========================================
>>
>>  -------------------------------------------------------------------------------
>>                        v6.11-rc3 mainline              zswap-mTHP     Change wrt
>>                                  Baseline                               Baseline
>>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
>>  -------------------------------------------------------------------------------
>>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>>                                       iaa                     iaa            iaa
>>  -------------------------------------------------------------------------------
>>  Throughput (KB/s)   136,113      140,044     140,363     151,938    3%       8%
>>  sys time (sec)       986.78       951.95      954.85      735.47    3%      23%
>>  memcg_high          124,183      127,513     138,651     133,884
>>  memcg_swap_high           0            0           0           0
>>  memcg_swap_fail     619,020      751,099           0           0
>>  pswpin                    0            0           0           0
>>  pswpout                   0            0           0           0
>>  zswpin                  656          569         624         639
>>  zswpout           9,413,603   11,284,812   9,453,761   9,385,910
>>  thp_swpout                0            0           0           0
>>  thp_swpout_               0            0           0           0
>>   fallback
>>  pgmajfault            3,470        3,382       4,633       3,611
>>  ZSWPOUT-64kB            n/a          n/a     590,768     586,521
>>  SWPOUT-64kB               0            0           0           0
>>  -------------------------------------------------------------------------------
>>
>>
>>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>>  =======================================================
>>
>>  ------------------------------------------------------------------------------
>>                        v6.11-rc3 mainline              zswap-mTHP    Change wrt
>>                                  Baseline                              Baseline
>>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
>>  ------------------------------------------------------------------------------
>>  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
>>                                      iaa                     iaa            iaa
>>  ------------------------------------------------------------------------------
>>  Throughput (KB/s)    164,220    172,523      165,005     174,536  0.5%      1%
>>  sys time (sec)        855.76     686.94       801.72      676.65    6%      1%
>>  memcg_high            14,628     16,247       14,951      16,096
>>  memcg_swap_high            0          0            0           0
>>  memcg_swap_fail       18,698     21,114            0           0
>>  pswpin                     0          0            0           0
>>  pswpout                    0          0            0           0
>>  zswpin                   663        665        5,333         781
>>  zswpout            8,419,458  8,992,065    8,546,895   9,355,760
>>  thp_swpout                 0          0            0           0
>>  thp_swpout_           18,697     21,113            0           0
>>   fallback
>>  pgmajfault             3,439      3,496        8,139       3,582
>>  ZSWPOUT-2048kB           n/a        n/a       16,684      18,270
>>  SWPOUT-2048kB              0          0            0           0
>>  -----------------------------------------------------------------------------
>>
>> We see improvements overall in throughput and sys time for zstd and
>> deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y).
>>
>>
>> Case 2: Baseline with CONFIG_THP_SWAP enabled.
>> ==============================================
>>
>> In this scenario, the "before" represents zswap rejecting mTHP, and the mTHP
>> being stored by the backing swap device.
>>
>> The "after" represents data with this patch-series, that results in 64K/2M
>> (m)THP being processed by zswap.
>>
>>  64KB mTHP (cgroup memory.high set to 40G):
>>  ==========================================
>>
>>  ------------------------------------------------------------------------------
>>                      v6.11-rc3 mainline              zswap-mTHP      Change wrt
>>                                Baseline                                Baseline
>>  ------------------------------------------------------------------------------
>>  ZSWAP compressor       zstd   deflate-        zstd    deflate-   zstd deflate-
>>                                     iaa                     iaa             iaa
>>  ------------------------------------------------------------------------------
>>  Throughput (KB/s)   161,496    156,343     140,363     151,938   -13%      -3%
>>  sys time (sec)       771.68     802.08      954.85      735.47   -24%       8%
>>  memcg_high          111,223    110,889     138,651     133,884
>>  memcg_swap_high           0          0           0           0
>>  memcg_swap_fail           0          0           0           0
>>  pswpin                   16         16           0           0
>>  pswpout           7,471,472  7,527,963           0           0
>>  zswpin                  635        605         624         639
>>  zswpout               1,509      1,478   9,453,761   9,385,910
>>  thp_swpout                0          0           0           0
>>  thp_swpout_               0          0           0           0
>>   fallback
>>  pgmajfault            3,616      3,430       4,633       3,611
>>  ZSWPOUT-64kB            n/a        n/a     590,768     586,521
>>  SWPOUT-64kB         466,967    470,498           0           0
>>  ------------------------------------------------------------------------------
>>
>>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>>  =======================================================
>>
>>  ------------------------------------------------------------------------------
>>                       v6.11-rc3 mainline              zswap-mTHP     Change wrt
>>                                 Baseline                               Baseline
>>  ------------------------------------------------------------------------------
>>  ZSWAP compressor       zstd    deflate-        zstd    deflate-  zstd deflate-
>>                                      iaa                     iaa            iaa
>>  ------------------------------------------------------------------------------
>>  Throughput (KB/s)    192,164    194,643     165,005     174,536  -14%     -10%
>>  sys time (sec)        823.55     830.42      801.72      676.65    3%      19%
>>  memcg_high            16,054     15,936      14,951      16,096
>>  memcg_swap_high            0          0           0           0
>>  memcg_swap_fail            0          0           0           0
>>  pswpin                     0          0           0           0
>>  pswpout            8,629,248  8,628,907           0           0
>>  zswpin                   560        645       5,333         781
>>  zswpout                1,416      1,503   8,546,895   9,355,760
>>  thp_swpout            16,854     16,853           0           0
>>  thp_swpout_                0          0           0           0
>>   fallback
>>  pgmajfault             3,341      3,574       8,139       3,582
>>  ZSWPOUT-2048kB           n/a        n/a      16,684      18,270
>>  SWPOUT-2048kB         16,854     16,853           0           0
>>  ------------------------------------------------------------------------------
>>
>> In the "Before" scenario, when zswap does not store mTHP, only allocations
>> count towards the cgroup memory limit. However, in the "After" scenario,
>> with the introduction of zswap_store() mTHP, both, allocations as well as
>> the zswap compressed pool usage from all 70 processes are counted towards
>> the memory limit. As a result, we see higher swapout activity in the
>> "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
>> charge leads to more frequent memory.high breaches.
>>
>> This causes degradation in throughput and sys time with zswap mTHP, more so
>> in case of zstd than deflate-iaa. Compress latency could play a part in
>> this - when there is more swapout activity happening, a slower compressor
>> would cause allocations to stall for any/all of the 70 processes.
>
> We are basically comparing zram with zswap in this case, and it's not
> fair because, as you mentioned, the zswap compressed data is being
> accounted for while the zram compressed data isn't. I am not really
> sure how valuable these test results are. Even if we remove the cgroup
> accounting from zswap, we won't see an improvement, we should expect a
> similar performance to zram.
>
> I think the test results that are really valuable are case 1, where
> zswap users are currently disabling CONFIG_THP_SWAP, and get to enable
> it after this series.
>
> If we really want to compare CONFIG_THP_SWAP on before and after, it
> should be with SSD because that's a more conventional setup. In this
> case the users that have CONFIG_THP_SWAP=y only experience the
> benefits of zswap with this series.

Yes.  I think so too.

> You mentioned experimenting with
> usemem to keep the memory allocated longer so that you're able to have
> a fair test with the small SSD swap setup. Did that work?

Looking forward to the results of this test too.

> I am hoping Nhat or Johannes would shed some light on whether they
> usually have CONFIG_THP_SWAP enabled or not with zswap. I am trying to
> figure out if any reasonable setups enable CONFIG_THP_SWAP with zswap.
> Otherwise the testing results from case 1 should be sufficient.

I guess that even if 2MB THP swapping may be not popular, 64KB mTHP
swapping to SSD or zswap looks much more appealing.

>>
>> In my opinion, even though the test set up does not provide an accurate
>> way for a direct before/after comparison (because of zswap usage being
>> counted in cgroup, hence towards the memory.high), it still seems
>> reasonable for zswap_store to support (m)THP, so that further performance
>> improvements can be implemented.
>
> This is only referring to the results of case 2, right?
>
> Honestly, I wouldn't want to merge mTHP swapout support on its own
> just because it enables further performance improvements without
> having actual patches for them. But I don't think this captures the
> results accurately as it dismisses case 1 results (which I think are
> more reasonable).

--
Best Regards,
Huang, Ying





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux