RE: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Thu, 26 Sep 2024 03:48:05 +0000

Hi Ying,

> -----Original Message-----
> From: Huang, Ying <ying.huang@xxxxxxxxx>
> Sent: Wednesday, September 25, 2024 5:45 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; nphamcs@xxxxxxxxx;
> chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx;
> shakeel.butt@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx;
> akpm@xxxxxxxxxxxxxxxxxxxx; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali,
> Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh
> <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> 
> "Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> writes:
> 
> >> -----Original Message-----
> >> From: Huang, Ying <ying.huang@xxxxxxxxx>
> >> Sent: Tuesday, September 24, 2024 11:35 PM
> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> >> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> >> hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; nphamcs@xxxxxxxxx;
> >> chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx;
> >> shakeel.butt@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx;
> >> akpm@xxxxxxxxxxxxxxxxxxxx; Zou, Nanhai <nanhai.zou@xxxxxxxxx>;
> Feghali,
> >> Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh
> >> <vinodh.gopal@xxxxxxxxx>
> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> >>
> >> Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> writes:
> >>
> >> [snip]
> >>
> >> >
> >> > Case 1: Comparing zswap 4K vs. zswap mTHP
> >> > =========================================
> >> >
> >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that
> results in
> >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
> >> >
> >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
> >> results
> >> > in 64K/2M (m)THP to not be split, and processed by zswap.
> >> >
> >> >  64KB mTHP (cgroup memory.high set to 40G):
> >> >  ==========================================
> >> >
> >> >  -------------------------------------------------------------------------------
> >> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
> >> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
> Baseline
> >> >                                  Baseline
> >> >  -------------------------------------------------------------------------------
> >> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >> >                                       iaa                     iaa            iaa
> >> >  -------------------------------------------------------------------------------
> >> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%
> 3%
> >> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
> >> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
> >> >  memcg_high          132,743      169,825     148,075     192,744
> >> >  memcg_swap_fail     639,067      841,553       2,204       2,215
> >> >  pswpin                    0            0           0           0
> >> >  pswpout                   0            0           0           0
> >> >  zswpin                  795          873         760         902
> >> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
> >> >  thp_swpout                0            0           0           0
> >> >  thp_swpout_               0            0           0           0
> >> >   fallback
> >> >  64kB-mthp_          639,065      841,553       2,204       2,215
> >> >   swpout_fallback
> >> >  pgmajfault            2,861        2,924       3,054       3,259
> >> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
> >> >  SWPOUT-64kB               0            0           0           0
> >> >  -------------------------------------------------------------------------------
> >> >
> >>
> >> IIUC, the throughput is the sum of throughput of all usemem processes?
> >>
> >> One possible issue of usemem test case is the "imbalance" issue.  That
> >> is, some usemem processes may swap-out/swap-in less, so the score is
> >> very high; while some other processes may swap-out/swap-in more, so the
> >> score is very low.  Sometimes, the total score decreases, but the scores
> >> of usemem processes are more balanced, so that the performance should
> be
> >> considered better.  And, in general, we should make usemem score
> >> balanced among processes via say longer test time.  Can you check this
> >> in your test results?
> >
> > Actually, the throughput data listed in the cover-letter is the average of
> > all the usemem processes. Your observation about the "imbalance" issue is
> > right. Some processes see a higher throughput than others. I have noticed
> > that the throughputs progressively reduce as the individual processes exit
> > and print their stats.
> >
> > Listed below are the stats from two runs of usemem70: sleep 10 and sleep
> 30.
> > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are
> > enabled, zswap uses zstd.
> >
> >
> > -----------------------------------------------
> >                sleep 10           sleep 30
> >       Throughput (KB/s)  Throughput (KB/s)
> >  -----------------------------------------------
> >                 181,540            191,686
> >                 179,651            191,459
> >                 179,068            188,834
> >                 177,244            187,568
> >                 177,215            186,703
> >                 176,565            185,584
> >                 176,546            185,370
> >                 176,470            185,021
> >                 176,214            184,303
> >                 176,128            184,040
> >                 175,279            183,932
> >                 174,745            180,831
> >                 173,935            179,418
> >                 161,546            168,014
> >                 160,332            167,540
> >                 160,122            167,364
> >                 159,613            167,020
> >                 159,546            166,590
> >                 159,021            166,483
> >                 158,845            166,418
> >                 158,426            166,264
> >                 158,396            166,066
> >                 158,371            165,944
> >                 158,298            165,866
> >                 158,250            165,884
> >                 158,057            165,533
> >                 158,011            165,532
> >                 157,899            165,457
> >                 157,894            165,424
> >                 157,839            165,410
> >                 157,731            165,407
> >                 157,629            165,273
> >                 157,626            164,867
> >                 157,581            164,636
> >                 157,471            164,266
> >                 157,430            164,225
> >                 157,287            163,290
> >                 156,289            153,597
> >                 153,970            147,494
> >                 148,244            147,102
> >                 142,907            146,111
> >                 142,811            145,789
> >                 139,171            141,168
> >                 136,314            140,714
> >                 133,616            140,111
> >                 132,881            139,636
> >                 132,729            136,943
> >                 132,680            136,844
> >                 132,248            135,726
> >                 132,027            135,384
> >                 131,929            135,270
> >                 131,766            134,748
> >                 131,667            134,733
> >                 131,576            134,582
> >                 131,396            134,302
> >                 131,351            134,160
> >                 131,135            134,102
> >                 130,885            134,097
> >                 130,854            134,058
> >                 130,767            134,006
> >                 130,666            133,960
> >                 130,647            133,894
> >                 130,152            133,837
> >                 130,006            133,747
> >                 129,921            133,679
> >                 129,856            133,666
> >                 129,377            133,564
> >                 128,366            133,331
> >                 127,988            132,938
> >                 126,903            132,746
> >  -----------------------------------------------
> >       sum    10,526,916         10,919,561
> >   average       150,385            155,994
> >    stddev        17,551             19,633
> >  -----------------------------------------------
> >     elapsed       24.40              43.66
> >  time (sec)
> >    sys time      806.25             766.05
> >       (sec)
> >     zswpout  10,008,713         10,008,407
> >   64K folio     623,463            623,629
> >      swpout
> >  -----------------------------------------------
> 
> Although there are some imbalance, I don't find it's too much.  So, I
> think the test result is reasonable.  Please pay attention to the
> imbalance issue in the future tests.

Sure, will do so.

> 
> > As we increase the time for which allocations are maintained,
> > there seems to be a slight improvement in throughput, but the
> > variance increases as well. The processes with lower throughput
> > could be the ones that handle the memcg being over limit by
> > doing reclaim, possibly before they can allocate.
> >
> > Interestingly, the longer test time does seem to reduce the amount
> > of reclaim (hence lower sys time), but more 64K large folios seem to
> > be reclaimed. Could this mean that with longer test time (sleep 30),
> > more cold memory residing in large folios is getting reclaimed, as
> > against memory just relinquished by the exiting processes?
> 
> I don't think longer sleep time in test helps much to balance.  Can you
> try with less process, and larger memory size per process?  I guess that
> this will improve balance.

I tried this, and the data is listed below:

  usemem options:
  ---------------
  30 processes allocate 10G each
  cgroup memory limit = 150G
  sleep 10
  525Gi SSD disk swap partition
  64K large folios enabled      

  Throughput (KB/s) of each of the 30 processes:
 ---------------------------------------------------------------
                      mm-unstable    zswap_store of large folios
                        9-25-2024                v7
 zswap compressor:           zstd         zstd  deflate-iaa
 ---------------------------------------------------------------
                           38,393      234,485      374,427
                           37,283      215,528      314,225
                           37,156      214,942      304,413
                           37,143      213,073      304,146
                           36,814      212,904      290,186
                           36,277      212,304      288,212
                           36,104      212,207      285,682
                           36,000      210,173      270,661
                           35,994      208,487      256,960
                           35,979      207,788      248,313
                           35,967      207,714      235,338
                           35,966      207,703      229,335
                           35,835      207,690      221,697
                           35,793      207,418      221,600
                           35,692      206,160      219,346
                           35,682      206,128      219,162
                           35,681      205,817      219,155
                           35,678      205,546      214,862
                           35,678      205,523      214,710
                           35,677      204,951      214,282
                           35,677      204,283      213,441
                           35,677      203,348      213,011
                           35,675      203,028      212,923
                           35,673      201,922      212,492
                           35,672      201,660      212,225
                           35,672      200,724      211,808
                           35,672      200,324      211,420
                           35,671      199,686      211,413
                           35,667      198,858      211,346
                           35,667      197,590      211,209
 ---------------------------------------------------------------
 sum                     1,081,515    6,217,964    7,268,000
 average                    36,051      207,265      242,267
 stddev                        655        7,010       42,234
 elapsed time (sec)         343.70       107.40        84.34
 sys time (sec)             269.30     2,520.13     1,696.20
 memcg.high breaches       443,672      475,074      623,333
 zswpout                    22,605   48,931,249   54,777,100
 pswpout                40,004,528            0            0
 hugepages-64K zswpout           0    3,057,090    3,421,855
 hugepages-64K swpout    2,500,283            0            0
 ---------------------------------------------------------------

As you can see, this is quite a memory-constrained scenario, where we
are giving a 50% of total memory required, as the memory limit for the
cgroup in which the 30 processes are run. This causes significantly more
reclaim activity than the setup I was using thus far (70 processes, 1G,
40G limit).

The variance or "imbalance" reduces somewhat for zstd, but not for IAA.

IAA shows really good throughput (17%) and elapsed time (21%) and
sys time (33%) improvement wrt zstd with zswap_store of large folios.
These are the memory-constrained scenarios in which IAA typically
does really well. IAA verify_compress is enabled, so this is an added
data integrity checks benefit we get with IAA.

I would like to get your and the maintainers' feedback on whether
I should switch to this "usemem30-10G" setup for v8?

Thanks,
Kanchana

> 
> --
> Best Regards,
> Huang, Ying