RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Tue, 20 Aug 2024 03:00:52 +0000

Hi Ying,

> -----Original Message-----
> From: Huang, Ying <ying.huang@xxxxxxxxx>
> Sent: Sunday, August 18, 2024 10:52 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; nphamcs@xxxxxxxxx;
> ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> "Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> writes:
> 
> > Hi Ying,
> >
> >> -----Original Message-----
> >> From: Huang, Ying <ying.huang@xxxxxxxxx>
> >> Sent: Sunday, August 18, 2024 8:17 PM
> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> >> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> >> hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; nphamcs@xxxxxxxxx;
> >> ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; akpm@linux-
> foundation.org;
> >> Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K
> >> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> >> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >>
> >> Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> writes:
> >>
> >> [snip]
> >>
> >> >
> >> > Performance Testing:
> >> > ====================
> >> > Testing of this patch-series was done with the v6.11-rc3 mainline,
> without
> >> > and with this patch-series, on an Intel Sapphire Rapids server,
> >> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >> >
> >> > The system has 503 GiB RAM, with a 4G SSD as the backing swap device
> for
> >> > ZSWAP. Core frequency was fixed at 2500MHz.
> >> >
> >> > The vm-scalability "usemem" test was run in a cgroup whose
> memory.high
> >> > was fixed. Following a similar methodology as in Ryan Roberts'
> >> > "Swap-out mTHP without splitting" series [2], 70 usemem processes
> were
> >> > run, each allocating and writing 1G of memory:
> >> >
> >> >     usemem --init-time -w -O -n 70 1g
> >> >
> >> > Since I was constrained to get the 70 usemem processes to generate
> >> > swapout activity with the 4G SSD, I ended up using different cgroup
> >> > memory.high fixed limits for the experiments with 64K mTHP and 2M
> THP:
> >> >
> >> > 64K mTHP experiments: cgroup memory fixed at 60G
> >> > 2M THP experiments  : cgroup memory fixed at 55G
> >> >
> >> > The vm/sysfs stats included after the performance data provide details
> >> > on the swapout activity to SSD/ZSWAP.
> >> >
> >> > Other kernel configuration parameters:
> >> >
> >> >     ZSWAP Compressor  : LZ4, DEFLATE-IAA
> >> >     ZSWAP Allocator   : ZSMALLOC
> >> >     SWAP page-cluster : 2
> >> >
> >> > In the experiments where "deflate-iaa" is used as the ZSWAP
> compressor,
> >> > IAA "compression verification" is enabled. Hence each IAA compression
> >> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> >> > returned by the hardware will be compared and errors reported in case
> of
> >> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> >> > compared to the software compressors.
> >> >
> >> > Throughput reported by usemem and perf sys time for running the test
> >> > are as follows, averaged across 3 runs:
> >> >
> >> >  64KB mTHP (cgroup memory.high set to 60G):
> >> >  ==========================================
> >> >   ------------------------------------------------------------------
> >> >  |                    |                   |            |            |
> >> >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> >> >  |                    |                   |       KB/s |            |
> >> >  |--------------------|-------------------|------------|------------|
> >> >  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
> >> >  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
> >>
> >> zswap throughput is worse than ssd swap?  This doesn't look right.
> >
> > I realize it might look that way, however, this is not an apples-to-apples
> comparison,
> > as explained in the latter part of my analysis (after the 2M THP data tables).
> > The primary reason for this is because of running the test under a fixed
> > cgroup memory limit.
> >
> > In the "Before" scenario, mTHP get swapped out to SSD. However, the disk
> swap
> > usage is not accounted towards checking if the cgroup's memory limit has
> been
> > exceeded. Hence there are relatively fewer swap-outs, resulting mainly
> from the
> > 1G allocations from each of the 70 usemem processes working with a 60G
> memory
> > limit on the parent cgroup.
> >
> > However, the picture changes in the "After" scenario. mTHPs will now get
> stored in
> > zswap, which is accounted for in the cgroup's memory.current and counts
> > towards the fixed memory limit in effect for the parent cgroup. As a result,
> when
> > mTHP get stored in zswap, the mTHP compressed data in the zswap zpool
> now
> > count towards the cgroup's active memory and memory limit. This is in
> addition
> > to the 1G allocations from each of the 70 processes.
> >
> > As you can see, this creates more memory pressure on the cgroup, resulting
> in
> > more swap-outs. With lz4 as the zswap compressor, this results in lesser
> throughput
> > wrt "Before".
> >
> > However, with IAA as the zswap compressor, the throughout with zswap
> mTHP is
> > better than "Before" because of better hardware compress latencies, which
> handle
> > the higher swap-out activity without compromising on throughput.
> >
> >>
> >> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    388,154 |        16% |
> >> >  |------------------------------------------------------------------|
> >> >  |                    |                   |            |            |
> >> >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> >> >  |                    |                   |        sec |            |
> >> >  |--------------------|-------------------|------------|------------|
> >> >  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
> >> >  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
> >> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     235.60 |      -158% |
> >> >   ------------------------------------------------------------------
> >> >
> >> >   -----------------------------------------------------------------------
> >> >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |
> zswap-
> >> mTHP |
> >> >  |                              |   mainline |       Store |       Store |
> >> >  |                              |            |         lz4 | deflate-iaa |
> >> >  |-----------------------------------------------------------------------|
> >> >  | pswpin                       |          0 |           0 |           0 |
> >> >  | pswpout                      |    174,432 |           0 |           0 |
> >> >  | zswpin                       |        703 |         534 |         721 |
> >> >  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
> >>
> >> It appears that the number of swapped pages for zswap is much larger
> >> than that of SSD swap.  Why?  I guess this is why zswap throughput is
> >> worse.
> >
> > Your observation is correct. I hope the above explanation helps as to the
> > reasoning behind this.
> 
> Before:
> (174432 + 1501) * 4 / 1024 = 687.2 MB
> 
> After:
> 1491654 * 4.0 / 1024 = 5826.8 MB
> 
> From your previous words, 10GB memory should be swapped out.
> 
> Even if the average compression ratio is 0, the swap-out count of zswap
> should be about 100% more than that of SSD.  However, the ratio here
> appears unreasonable.

Excellent point! In order to understand this better myself, I ran usemem with
1 process that tries to allocate 58G:

cgroup memory.high = 60,000,000,000
usemem --init-time -w -O -n 1 58g

usemem -n 1 58g         Before          After
----------------------------------------------
pswpout                 586,352             0
zswpout                   1,005     1,042,963
----------------------------------------------
Total swapout           587,357     1,042,963
----------------------------------------------

In the case where the cgroup has only 1 process, your rationale above applies
(more or less). This shows the stats collected every 100 micro-seconds from the critical
section of the workload right before the memory limit is reached (Before and After):

===========================================================================
BEFORE zswap_store mTHP:
===========================================================================
 cgroup_memory        cgroup_memory      zswap_pool        zram_compr
                          w/o zswap     _total_size        _data_size
---------------------------------------------------------------------------
59,999,600,640       59,999,600,640               0                74
59,999,911,936       59,999,911,936               0        14,139,441
60,000,083,968       59,997,634,560       2,449,408        53,448,205
59,999,952,896       59,997,503,488       2,449,408        93,477,490
60,000,083,968       59,997,634,560       2,449,408       133,152,754
60,000,083,968       59,997,634,560       2,449,408       172,628,328
59,999,952,896       59,997,503,488       2,449,408       212,760,840
60,000,083,968       59,997,634,560       2,449,408       251,999,675
60,000,083,968       59,997,634,560       2,449,408       291,058,130
60,000,083,968       59,997,634,560       2,449,408       329,655,206
59,999,793,152       59,997,343,744       2,449,408       368,938,904
59,999,924,224       59,997,474,816       2,449,408       408,652,723
59,999,924,224       59,997,474,816       2,449,408       447,830,071
60,000,055,296       59,997,605,888       2,449,408       487,776,082
59,999,924,224       59,997,474,816       2,449,408       526,826,360
60,000,055,296       59,997,605,888       2,449,408       566,193,520
60,000,055,296       59,997,605,888       2,449,408       604,625,879
60,000,055,296       59,997,605,888       2,449,408       642,545,706
59,999,924,224       59,997,474,816       2,449,408       681,958,173
59,999,924,224       59,997,474,816       2,449,408       721,908,162
59,999,924,224       59,997,474,816       2,449,408       761,935,307
59,999,924,224       59,997,474,816       2,449,408       802,014,594
59,999,924,224       59,997,474,816       2,449,408       842,087,656
59,999,924,224       59,997,474,816       2,449,408       883,889,588
59,999,924,224       59,997,474,816       2,449,408       804,458,184
59,999,793,152       59,997,343,744       2,449,408        94,150,548
54,938,513,408       54,936,064,000       2,449,408           172,644
29,492,523,008       29,490,073,600       2,449,408           172,644
 3,465,621,504        3,463,172,096       2,449,408           131,457
---------------------------------------------------------------------------

===========================================================================
AFTER zswap_store mTHP:
===========================================================================
 cgroup_memory        cgroup_memory          zswap_pool
                          w/o zswap         _total_size
---------------------------------------------------------------------------
55,578,234,880       55,578,234,880                   0
56,104,095,744       56,104,095,744                   0
56,644,898,816       56,644,898,816                   0
57,184,653,312       57,184,653,312                   0
57,706,057,728       57,706,057,728                   0
58,226,937,856       58,226,937,856                   0
58,747,293,696       58,747,293,696                   0
59,275,776,000       59,275,776,000                   0
59,793,772,544       59,793,772,544                   0
60,000,141,312       60,000,141,312                   0
59,999,956,992       59,999,956,992                   0
60,000,169,984       60,000,169,984                   0
59,999,907,840       59,951,226,880          48,680,960
60,000,169,984       59,900,010,496         100,159,488
60,000,169,984       59,848,007,680         152,162,304
60,000,169,984       59,795,513,344         204,656,640
59,999,907,840       59,743,477,760         256,430,080
60,000,038,912       59,692,097,536         307,941,376
60,000,169,984       59,641,208,832         358,961,152
60,000,038,912       59,589,992,448         410,046,464
60,000,169,984       59,539,005,440         461,164,544
60,000,169,984       59,487,657,984         512,512,000
60,000,038,912       59,434,868,736         565,170,176
60,000,038,912       59,383,259,136         616,779,776
60,000,169,984       59,331,518,464         668,651,520
60,000,169,984       59,279,843,328         720,326,656
60,000,169,984       59,228,626,944         771,543,040
59,999,907,840       59,176,984,576         822,923,264
60,000,038,912       59,124,326,400         875,712,512
60,000,169,984       59,072,454,656         927,715,328
60,000,169,984       59,020,156,928         980,013,056
60,000,038,912       58,966,974,464       1,033,064,448
60,000,038,912       58,913,628,160       1,086,410,752
60,000,038,912       58,858,840,064       1,141,198,848
60,000,169,984       58,804,314,112       1,195,855,872
59,999,907,840       58,748,936,192       1,250,971,648
60,000,169,984       58,695,131,136       1,305,038,848
60,000,169,984       58,642,800,640       1,357,369,344
60,000,169,984       58,589,782,016       1,410,387,968
60,000,038,912       58,535,124,992       1,464,913,920
60,000,169,984       58,482,925,568       1,517,244,416
60,000,169,984       58,429,775,872       1,570,394,112
60,000,038,912       58,376,658,944       1,623,379,968
60,000,169,984       58,323,247,104       1,676,922,880
60,000,038,912       58,271,113,216       1,728,925,696
60,000,038,912       58,216,292,352       1,783,746,560
60,000,038,912       58,164,289,536       1,835,749,376
60,000,038,912       58,112,090,112       1,887,948,800
60,000,038,912       58,058,350,592       1,941,688,320
59,999,907,840       58,004,971,520       1,994,936,320
60,000,169,984       57,953,165,312       2,047,004,672
59,999,907,840       57,900,277,760       2,099,630,080
60,000,038,912       57,847,586,816       2,152,452,096
60,000,169,984       57,793,421,312       2,206,748,672
59,999,907,840       57,741,582,336       2,258,325,504
60,012,826,624       57,734,840,320       2,277,986,304
60,098,793,472       57,820,348,416       2,278,445,056
60,176,334,848       57,897,889,792       2,278,445,056
60,269,826,048       57,991,380,992       2,278,445,056
59,687,481,344       57,851,977,728       1,835,503,616
59,049,836,544       57,888,108,544       1,161,728,000
58,406,068,224       57,929,551,872         476,516,352
43,837,923,328       43,837,919,232               4,096
18,124,546,048       18,124,541,952               4,096
     2,846,720            2,842,624               4,096
---------------------------------------------------------------------------

I have also attached plots of the memory pressure reported by PSI. Both these sets
of data should give a sense of the added memory pressure on the cgroup because of
zswap mTHP stores. The data shows that the cgroup is over the limit much more
frequently in the "After" than in "Before". However, the rationale that you suggested
seems more reasonable and apparent in the 1 process case.

However, with 70 processes trying to allocate 1G, things get more complicated.
These are the functions that should provide more clarity:

[1] mm/memcontrol.c: mem_cgroup_handle_over_high().
[2] mm/memcontrol.c: try_charge_memcg().
[3] include/linux/resume_user_mode.h: resume_user_mode_work().

At a high level, when zswap mTHP compressed pool usage starts counting towards
cgroup.memory.current, there are two inter-related effects occurring that ultimately
cause more reclaim to happen:

1) When each process reclaims a folio and zswap_store() writes out each page in
     the folio, it charges the compressed size to the memcg
    "obj_cgroup_charge_zswap(objcg, entry->length);". This calls [2] and sets
    current->memcg_nr_pages_over_high if the limit is exceeded. The comments
    towards the end of [2] are relevant.
2) When each of the processes returns from a page-fault, it checks if the
    cgroup memory usage is over the limit in [3], and if so, it will trigger
    reclaim.

I confirmed that in the case of usemem, all calls to [1] occur from the code path in [3].
However, my takeaway from this is that the more reclaim that results in zswap_store(),
for e.g., from mTHP folios, there is higher likelihood of overage recorded per-process in
current->memcg_nr_pages_over_high, which could potentially be causing each
process to reclaim memory, even if it is possible that the swapout from a few of
the 70 processes could have brought the parent cgroup under the limit.

Please do let me know if you have any other questions. Appreciate your feedback
and comments.

Thanks,
Kanchana

> 
> --
> Best Regards,
> Huang, Ying
> 
> > Thanks,
> > Kanchana
> >
> >>
> >> >  |-----------------------------------------------------------------------|
> >> >  | thp_swpout                   |          0 |           0 |           0 |
> >> >  | thp_swpout_fallback          |          0 |           0 |           0 |
> >> >  | pgmajfault                   |      3,364 |       3,650 |       3,431 |
> >> >  |-----------------------------------------------------------------------|
> >> >  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
> >> >  |-----------------------------------------------------------------------|
> >> >  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
> >> >   -----------------------------------------------------------------------
> >> >
> >>
> >> [snip]
> >>
> >> --
> >> Best Regards,
> >> Huang, Ying
Attachment:
usemem_n1_Before.jpg

Description: usemem_n1_Before.jpg
Attachment:
usemem_n1_After.jpg

Description: usemem_n1_After.jpg