Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios

"Huang, Ying" <ying.huang@xxxxxxxxx> · Thu, 26 Sep 2024 08:44:36 +0800

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> writes:

>> -----Original Message-----
>> From: Huang, Ying <ying.huang@xxxxxxxxx>
>> Sent: Tuesday, September 24, 2024 11:35 PM
>> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
>> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
>> hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; nphamcs@xxxxxxxxx;
>> chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx;
>> shakeel.butt@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx;
>> akpm@xxxxxxxxxxxxxxxxxxxx; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali,
>> Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh
>> <vinodh.gopal@xxxxxxxxx>
>> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
>> 
>> Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> writes:
>> 
>> [snip]
>> 
>> >
>> > Case 1: Comparing zswap 4K vs. zswap mTHP
>> > =========================================
>> >
>> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
>> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
>> >
>> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
>> results
>> > in 64K/2M (m)THP to not be split, and processed by zswap.
>> >
>> >  64KB mTHP (cgroup memory.high set to 40G):
>> >  ==========================================
>> >
>> >  -------------------------------------------------------------------------------
>> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
>> >                                  Baseline
>> >  -------------------------------------------------------------------------------
>> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>> >                                       iaa                     iaa            iaa
>> >  -------------------------------------------------------------------------------
>> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%       3%
>> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
>> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
>> >  memcg_high          132,743      169,825     148,075     192,744
>> >  memcg_swap_fail     639,067      841,553       2,204       2,215
>> >  pswpin                    0            0           0           0
>> >  pswpout                   0            0           0           0
>> >  zswpin                  795          873         760         902
>> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
>> >  thp_swpout                0            0           0           0
>> >  thp_swpout_               0            0           0           0
>> >   fallback
>> >  64kB-mthp_          639,065      841,553       2,204       2,215
>> >   swpout_fallback
>> >  pgmajfault            2,861        2,924       3,054       3,259
>> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
>> >  SWPOUT-64kB               0            0           0           0
>> >  -------------------------------------------------------------------------------
>> >
>> 
>> IIUC, the throughput is the sum of throughput of all usemem processes?
>> 
>> One possible issue of usemem test case is the "imbalance" issue.  That
>> is, some usemem processes may swap-out/swap-in less, so the score is
>> very high; while some other processes may swap-out/swap-in more, so the
>> score is very low.  Sometimes, the total score decreases, but the scores
>> of usemem processes are more balanced, so that the performance should be
>> considered better.  And, in general, we should make usemem score
>> balanced among processes via say longer test time.  Can you check this
>> in your test results?
>
> Actually, the throughput data listed in the cover-letter is the average of
> all the usemem processes. Your observation about the "imbalance" issue is
> right. Some processes see a higher throughput than others. I have noticed
> that the throughputs progressively reduce as the individual processes exit
> and print their stats.
>
> Listed below are the stats from two runs of usemem70: sleep 10 and sleep 30.
> Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are
> enabled, zswap uses zstd. 
>
>
> -----------------------------------------------
>                sleep 10           sleep 30
>       Throughput (KB/s)  Throughput (KB/s)
>  -----------------------------------------------
>                 181,540            191,686
>                 179,651            191,459
>                 179,068            188,834
>                 177,244            187,568
>                 177,215            186,703
>                 176,565            185,584
>                 176,546            185,370
>                 176,470            185,021
>                 176,214            184,303
>                 176,128            184,040
>                 175,279            183,932
>                 174,745            180,831
>                 173,935            179,418
>                 161,546            168,014
>                 160,332            167,540
>                 160,122            167,364
>                 159,613            167,020
>                 159,546            166,590
>                 159,021            166,483
>                 158,845            166,418
>                 158,426            166,264
>                 158,396            166,066
>                 158,371            165,944
>                 158,298            165,866
>                 158,250            165,884
>                 158,057            165,533
>                 158,011            165,532
>                 157,899            165,457
>                 157,894            165,424
>                 157,839            165,410
>                 157,731            165,407
>                 157,629            165,273
>                 157,626            164,867
>                 157,581            164,636
>                 157,471            164,266
>                 157,430            164,225
>                 157,287            163,290
>                 156,289            153,597
>                 153,970            147,494
>                 148,244            147,102
>                 142,907            146,111
>                 142,811            145,789
>                 139,171            141,168
>                 136,314            140,714
>                 133,616            140,111
>                 132,881            139,636
>                 132,729            136,943
>                 132,680            136,844
>                 132,248            135,726
>                 132,027            135,384
>                 131,929            135,270
>                 131,766            134,748
>                 131,667            134,733
>                 131,576            134,582
>                 131,396            134,302
>                 131,351            134,160
>                 131,135            134,102
>                 130,885            134,097
>                 130,854            134,058
>                 130,767            134,006
>                 130,666            133,960
>                 130,647            133,894
>                 130,152            133,837
>                 130,006            133,747
>                 129,921            133,679
>                 129,856            133,666
>                 129,377            133,564
>                 128,366            133,331
>                 127,988            132,938
>                 126,903            132,746
>  -----------------------------------------------
>       sum    10,526,916         10,919,561
>   average       150,385            155,994
>    stddev        17,551             19,633
>  -----------------------------------------------
>     elapsed       24.40              43.66
>  time (sec)
>    sys time      806.25             766.05
>       (sec)
>     zswpout  10,008,713         10,008,407
>   64K folio     623,463            623,629
>      swpout
>  -----------------------------------------------

Although there are some imbalance, I don't find it's too much.  So, I
think the test result is reasonable.  Please pay attention to the
imbalance issue in the future tests.

> As we increase the time for which allocations are maintained,
> there seems to be a slight improvement in throughput, but the
> variance increases as well. The processes with lower throughput
> could be the ones that handle the memcg being over limit by
> doing reclaim, possibly before they can allocate.
>
> Interestingly, the longer test time does seem to reduce the amount
> of reclaim (hence lower sys time), but more 64K large folios seem to
> be reclaimed. Could this mean that with longer test time (sleep 30),
> more cold memory residing in large folios is getting reclaimed, as
> against memory just relinquished by the exiting processes?

I don't think longer sleep time in test helps much to balance.  Can you
try with less process, and larger memory size per process?  I guess that
this will improve balance.

--
Best Regards,
Huang, Ying