> -----Original Message----- > From: Huang, Ying <ying.huang@xxxxxxxxx> > Sent: Tuesday, September 24, 2024 11:35 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; nphamcs@xxxxxxxxx; > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx; > shakeel.butt@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; > akpm@xxxxxxxxxxxxxxxxxxxx; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, > Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh > <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > > Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> writes: > > [snip] > > > > > Case 1: Comparing zswap 4K vs. zswap mTHP > > ========================================= > > > > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in > > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > > > > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that > results > > in 64K/2M (m)THP to not be split, and processed by zswap. > > > > 64KB mTHP (cgroup memory.high set to 40G): > > ========================================== > > > > ------------------------------------------------------------------------------- > > mm-unstable 9-23-2024 zswap-mTHP Change wrt > > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline > > Baseline > > ------------------------------------------------------------------------------- > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------- > > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% 3% > > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > > memcg_high 132,743 169,825 148,075 192,744 > > memcg_swap_fail 639,067 841,553 2,204 2,215 > > pswpin 0 0 0 0 > > pswpout 0 0 0 0 > > zswpin 795 873 760 902 > > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > 64kB-mthp_ 639,065 841,553 2,204 2,215 > > swpout_fallback > > pgmajfault 2,861 2,924 3,054 3,259 > > ZSWPOUT-64kB n/a n/a 623,451 822,268 > > SWPOUT-64kB 0 0 0 0 > > ------------------------------------------------------------------------------- > > > > IIUC, the throughput is the sum of throughput of all usemem processes? > > One possible issue of usemem test case is the "imbalance" issue. That > is, some usemem processes may swap-out/swap-in less, so the score is > very high; while some other processes may swap-out/swap-in more, so the > score is very low. Sometimes, the total score decreases, but the scores > of usemem processes are more balanced, so that the performance should be > considered better. And, in general, we should make usemem score > balanced among processes via say longer test time. Can you check this > in your test results? Actually, the throughput data listed in the cover-letter is the average of all the usemem processes. Your observation about the "imbalance" issue is right. Some processes see a higher throughput than others. I have noticed that the throughputs progressively reduce as the individual processes exit and print their stats. Listed below are the stats from two runs of usemem70: sleep 10 and sleep 30. Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are enabled, zswap uses zstd. ----------------------------------------------- sleep 10 sleep 30 Throughput (KB/s) Throughput (KB/s) ----------------------------------------------- 181,540 191,686 179,651 191,459 179,068 188,834 177,244 187,568 177,215 186,703 176,565 185,584 176,546 185,370 176,470 185,021 176,214 184,303 176,128 184,040 175,279 183,932 174,745 180,831 173,935 179,418 161,546 168,014 160,332 167,540 160,122 167,364 159,613 167,020 159,546 166,590 159,021 166,483 158,845 166,418 158,426 166,264 158,396 166,066 158,371 165,944 158,298 165,866 158,250 165,884 158,057 165,533 158,011 165,532 157,899 165,457 157,894 165,424 157,839 165,410 157,731 165,407 157,629 165,273 157,626 164,867 157,581 164,636 157,471 164,266 157,430 164,225 157,287 163,290 156,289 153,597 153,970 147,494 148,244 147,102 142,907 146,111 142,811 145,789 139,171 141,168 136,314 140,714 133,616 140,111 132,881 139,636 132,729 136,943 132,680 136,844 132,248 135,726 132,027 135,384 131,929 135,270 131,766 134,748 131,667 134,733 131,576 134,582 131,396 134,302 131,351 134,160 131,135 134,102 130,885 134,097 130,854 134,058 130,767 134,006 130,666 133,960 130,647 133,894 130,152 133,837 130,006 133,747 129,921 133,679 129,856 133,666 129,377 133,564 128,366 133,331 127,988 132,938 126,903 132,746 ----------------------------------------------- sum 10,526,916 10,919,561 average 150,385 155,994 stddev 17,551 19,633 ----------------------------------------------- elapsed 24.40 43.66 time (sec) sys time 806.25 766.05 (sec) zswpout 10,008,713 10,008,407 64K folio 623,463 623,629 swpout ----------------------------------------------- As we increase the time for which allocations are maintained, there seems to be a slight improvement in throughput, but the variance increases as well. The processes with lower throughput could be the ones that handle the memcg being over limit by doing reclaim, possibly before they can allocate. Interestingly, the longer test time does seem to reduce the amount of reclaim (hence lower sys time), but more 64K large folios seem to be reclaimed. Could this mean that with longer test time (sleep 30), more cold memory residing in large folios is getting reclaimed, as against memory just relinquished by the exiting processes? Thanks, Kanchana > > [snip] > > -- > Best Regards, > Huang, Ying