> -----Original Message----- > From: Huang, Ying <ying.huang@xxxxxxxxx> > Sent: Friday, September 20, 2024 2:29 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; > linux-mm@xxxxxxxxx; hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx; > ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; > Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v6 0/3] mm: ZSWAP swap-out of mTHP folios > > "Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> writes: > > [snip] > > > > > Thanks, these are good points. I ran this experiment with mm-unstable 9- > 17-2024, > > commit 248ba8004e76eb335d7e6079724c3ee89a011389. > > > > Data is based on average of 3 runs of the vm-scalability "usemem" test. > > > > 4G SSD backing zswap, each process sleeps before exiting > > ======================================================== > > > > 64KB mTHP (cgroup memory.high set to 60G, no swap limit): > > ========================================================= > > CONFIG_THP_SWAP=Y > > Sapphire Rapids server with 503 GiB RAM and 4G SSD swap backing device > > for zswap. > > > > Experiment 1: Each process sleeps for 0 sec after allocating memory > > (usemem --init-time -w -O --sleep 0 -n 70 1g): > > > > ------------------------------------------------------------------------------- > > mm-unstable 9-17-2024 zswap-mTHP v6 Change wrt > > Baseline Baseline > > "before" "after" (sleep 0) > > ------------------------------------------------------------------------------- > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------- > > Throughput (KB/s) 296,684 274,207 359,722 390,162 21% 42% > > sys time (sec) 92.67 93.33 251.06 237.56 -171% -155% > > memcg_high 3,503 3,769 44,425 27,154 > > memcg_swap_fail 0 0 115,814 141,936 > > pswpin 17 0 0 0 > > pswpout 370,853 393,232 0 0 > > zswpin 693 123 666 667 > > zswpout 1,484 123 1,366,680 1,199,645 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > pgmajfault 3,384 2,951 3,656 3,468 > > ZSWPOUT-64kB n/a n/a 82,940 73,121 > > SWPOUT-64kB 23,178 24,577 0 0 > > ------------------------------------------------------------------------------- > > > > > > Experiment 2: Each process sleeps for 10 sec after allocating memory > > (usemem --init-time -w -O --sleep 10 -n 70 1g): > > > > ------------------------------------------------------------------------------- > > mm-unstable 9-17-2024 zswap-mTHP v6 Change wrt > > Baseline Baseline > > "before" "after" (sleep 10) > > ------------------------------------------------------------------------------- > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------- > > Throughput (KB/s) 86,744 93,730 157,528 113,110 82% 21% > > sys time (sec) 308.87 315.29 477.55 629.98 -55% -100% > > What is the elapsed time for all cases? Sure, listed below is the data for both experiments with elapsed time in row 2: 4G SSD backing zswap, each process sleeps before exiting ======================================================== 64KB mTHP (cgroup memory.high set to 60G, no swap limit): ========================================================= CONFIG_THP_SWAP=Y Sapphire Rapids server with 503 GiB RAM and 4G SSD swap backing device for zswap. Experiment 1: Each process sleeps for 0 sec after allocating memory (usemem --init-time -w -O --sleep 0 -n 70 1g): ------------------------------------------------------------------------------- mm-unstable 9-17-2024 zswap-mTHP v6 Change wrt Baseline Baseline "before" "after" (sleep 0) ------------------------------------------------------------------------------- ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- iaa iaa iaa ------------------------------------------------------------------------------- Throughput (KB/s) 296,684 274,207 359,722 390,162 21% 42% elapsed time (sec) 4.91 4.80 4.42 5.08 10% -6% sys time (sec) 92.67 93.33 251.06 237.56 -171% -155% memcg_high 3,503 3,769 44,425 27,154 memcg_swap_fail 0 0 115,814 141,936 pswpin 17 0 0 0 pswpout 370,853 393,232 0 0 zswpin 693 123 666 667 zswpout 1,484 123 1,366,680 1,199,645 thp_swpout 0 0 0 0 thp_swpout_ 0 0 0 0 fallback pgmajfault 3,384 2,951 3,656 3,468 ZSWPOUT-64kB n/a n/a 82,940 73,121 SWPOUT-64kB 23,178 24,577 0 0 ------------------------------------------------------------------------------- Experiment 2: Each process sleeps for 10 sec after allocating memory (usemem --init-time -w -O --sleep 10 -n 70 1g): ------------------------------------------------------------------------------- mm-unstable 9-17-2024 zswap-mTHP v6 Change wrt Baseline Baseline "before" "after" (sleep 10) ------------------------------------------------------------------------------- ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- iaa iaa iaa ------------------------------------------------------------------------------- Throughput (KB/s) 86,744 93,730 157,528 113,110 82% 21% elapsed time (sec) 30.24 31.73 33.39 32.50 -10% -2% sys time (sec) 308.87 315.29 477.55 629.98 -55% -100% memcg_high 169,450 188,700 143,691 177,887 memcg_swap_fail 10,131,859 9,740,646 18,738,715 19,528,110 pswpin 17 16 0 0 pswpout 1,154,779 1,210,485 0 0 zswpin 711 659 1,016 736 zswpout 70,212 50,128 1,235,560 1,275,917 thp_swpout 0 0 0 0 thp_swpout_ 0 0 0 0 fallback pgmajfault 6,120 6,291 8,789 6,474 ZSWPOUT-64kB n/a n/a 67,587 68,912 SWPOUT-64kB 72,174 75,655 0 0 ------------------------------------------------------------------------------- > > > memcg_high 169,450 188,700 143,691 177,887 > > memcg_swap_fail 10,131,859 9,740,646 18,738,715 19,528,110 > > pswpin 17 16 0 0 > > pswpout 1,154,779 1,210,485 0 0 > > zswpin 711 659 1,016 736 > > zswpout 70,212 50,128 1,235,560 1,275,917 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > pgmajfault 6,120 6,291 8,789 6,474 > > ZSWPOUT-64kB n/a n/a 67,587 68,912 > > SWPOUT-64kB 72,174 75,655 0 0 > > ------------------------------------------------------------------------------- > > > > > > Conclusions from the experiments: > > ================================= > > 1) zswap-mTHP improves throughput as compared to the baseline, for zstd > and > > deflate-iaa. > > > > 2) Yosry's theory is proved correct in the 4G constrained swap setup. > > When the processes are constrained to sleep 10 sec after allocating > > memory, thereby keeping the memory allocated longer, the "Baseline" or > > "before" with mTHP getting stored in SSD shows a degradation of 71% in > > throughput and 238% in sys time, as compared to the "Baseline" with > > Higher sys time may come from compression with CPU vs. disk writing? > Here, I was comparing the "before" sys times between "sleep 10" and "sleep 0" experiments where mTHP get stored to SSD. I was trying to understand the increase in "before" sys time in "sleep 10", and my analysis was this could be due to the following cycle of events: memory remaining allocated longer, any reclaimed memory per process is mostly cold memory and is not paged back in (17 pswpin for zstd), swap slots are not released, swap slot allocation failures, folios in the reclaim list returned to being active, more swapout activity in "before"/"sleep 10" (372,337 zstd) as compared to "before"/"sleep 0" (1,224,991 zstd), more sys time in "before"/"sleep 10" as compared to "before"/"sleep 0". IOW, my takeaway from only the "before" experiments with sleep 10 vs. sleep 0 was the higher swapout activity resulting in increased sys time. The zswap-mTHP "after" experiments don't show significantly higher successful swapout activity between "sleep 10" vs. "sleep 0". This is not to say that the above cycle of events does not occur here as well, as indicated by the higher memcg_swap_fail counts, signifying attempted swapouts. However, the zswap-mTHP "after" sys time increase going from "sleep 0" to "sleep 10" is not as bad as that for "before": "before" = 4G SSD mTHP "after" = zswap-mTHP ------------------------------------------------------------------------- mm-unstable 9-17-2024 zswap-mTHP v6 Baseline "before" "after" ------------------------------------------------------------------------- ZSWAP compressor zstd deflate-iaa zstd deflate-iaa ------------------------------------------------------------------------- "sleep 0" sys time (sec) 92.67 93.33 251.06 237.56 "sleep 10" sys time (sec) 308.87 315.29 477.55 629.98 ------------------------------------------------------------------------- "sleep 10" sys time -233% -238% -90% -165% vs. "sleep 0" ------------------------------------------------------------------------- > > sleep 0 that benefits from serialization of disk IO not allowing all > > processes to allocate memory at the same time. > > > > 3) In the 4G SSD "sleep 0" case, zswap-mTHP shows an increase in sys time > > due to the cgroup charging and consequently higher memcg.high breaches > > and swapout activity. > > > > However, the "sleep 10" case's sys time seems to degrade less, and the > > memcg.high breaches and swapout activity are almost similar between > the > > before/after (confirming Yosry's hypothesis). Further, the > > memcg_swap_fail activity in the "after" scenario is almost 2X that of > > the "before". This indicates failure to obtain swap offsets, resulting > > in the folio remaining active in memory. > > > > I tried to better understand this through the 64k mTHP swpout_fallback > > stats in the "sleep 10" zstd experiments: > > > > -------------------------------------------------------------- > > "before" "after" > > -------------------------------------------------------------- > > 64k mTHP swpout_fallback 627,308 897,407 > > 64k folio swapouts 72,174 67,587 > > [p|z]swpout events due to 64k mTHP 1,154,779 1,081,397 > > 4k folio swapouts 70,212 154,163 > > -------------------------------------------------------------- > > > > The data indicates a higher # of 64k folio swpout_fallback with > > zswap-mTHP, that co-relates with the higher memcg_swap_fail counts and > > 4k folio swapouts with zswap-mTHP. Could the root-cause be > fragmentation > > of the swap space due to zswap swapout being faster than SSD swapout? > > > > [snip] > > -- > Best Regards, > Huang, Ying