> -----Original Message----- > From: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Sent: Tuesday, August 27, 2024 11:42 AM > To: Nhat Pham <nphamcs@xxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; ryan.roberts@xxxxxxx; > Huang, Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>; > Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > > > -----Original Message----- > > From: Nhat Pham <nphamcs@xxxxxxxxx> > > Sent: Tuesday, August 27, 2024 8:24 AM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > > hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; ryan.roberts@xxxxxxx; > > Huang, Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux- > > foundation.org; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > > > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P > > <kanchana.p.sridhar@xxxxxxxxx> wrote: > > > > > > > Internally, we often see 1-3 or 1-4 saving ratio (or even more). > > > > > > Agree with this as well. In our experiments with other workloads, we > > > typically see much higher ratios. > > > > > > > > > > > Probably does not explain everything, but worth double checking - > > > > could you check with zstd to see if the ratio improves. > > > > > > Sure. I gathered ratio and compressed memory footprint data today with > > > 64K mTHP, the 4G SSD swapfile and different zswap compressors. > > > > > > This patch-series and no zswap charging, 64K mTHP: > > > --------------------------------------------------------------------------- > > > Total Total Average Average Comp > > > compressed compression compressed compression ratio > > > length latency length latency > > > bytes milliseconds bytes nanoseconds > > > --------------------------------------------------------------------------- > > > SSD (no zswap) 1,362,296,832 887,861 > > > lz4 2,610,657,430 55,984 2,055 44,065 1.99 > > > zstd 729,129,528 50,986 565 39,510 7.25 > > > deflate-iaa 1,286,533,438 44,785 1,415 49,252 2.89 > > > --------------------------------------------------------------------------- > > > > > > zstd does very well on ratio, as expected. > > > > Wait. So zstd is displaying 7-to-1 compression ratio? And has *lower* > > average latency? > > > > Why are we running benchmark on lz4 again? Sure there is no free lunch > > and no compressor that works well on all kind of data, but lz4's > > performance here is so bad that it's borderline justifiable to > > disable/bypass zswap with this kind of compresison ratio... > > > > Can I ask you to run benchmarking on zstd from now on? > > Sure, will do. > > > > > > > > > > > > > > > > > > > > > > > > > Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP: > > > > > ------------------------------------------------------------ > > > > > > > > > > I wanted to take a step back and understand how the mainline v6.11- > > rc3 > > > > > handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) > and > > > > when > > > > > swapped out to ZSWAP. Interestingly, higher swapout activity is > > observed > > > > > with 4K folios and v6.11-rc3 (with the debug change to not charge > > zswap to > > > > > cgroup). > > > > > > > > > > v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP: > > > > > > > > > > ------------------------------------------------------------- > > > > > SSD (CONFIG_ZSWAP is OFF) ZSWAP lz4 lzo-rle > > > > > ------------------------------------------------------------- > > > > > cgroup memory.events: cgroup memory.events: > > > > > > > > > > low 0 low 0 0 > > > > > high 5,068 high 321,923 375,116 > > > > > max 0 max 0 0 > > > > > oom 0 oom 0 0 > > > > > oom_kill 0 oom_kill 0 0 > > > > > oom_group_kill 0 oom_group_kill 0 0 > > > > > ------------------------------------------------------------- > > > > > > > > > > SSD (CONFIG_ZSWAP is OFF): > > > > > -------------------------- > > > > > pswpout 415,709 > > > > > sys time (sec) 301.02 > > > > > Throughput KB/s 155,970 > > > > > memcg_high events 5,068 > > > > > -------------------------- > > > > > > > > > > > > > > > ZSWAP lz4 lz4 lz4 lzo-rle > > > > > -------------------------------------------------------------- > > > > > zswpout 1,598,550 1,515,151 1,449,432 1,493,917 > > > > > sys time (sec) 889.36 481.21 581.22 635.75 > > > > > Throughput KB/s 35,176 14,765 20,253 21,407 > > > > > memcg_high events 321,923 412,733 369,976 375,116 > > > > > -------------------------------------------------------------- > > > > > > > > > > This shows that there is a performance regression of -60% to -195% > > with > > > > > zswap as compared to SSD with 4K folios. The higher swapout activity > > with > > > > > zswap is seen here too (i.e., this doesn't appear to be mTHP-specific). > > > > > > > > > > I verified this to be the case even with the v6.7 kernel, which also > > > > > showed a 2.3X throughput improvement when we don't charge > zswap: > > > > > > > > > > ZSWAP lz4 v6.7 v6.7 with no cgroup zswap charge > > > > > -------------------------------------------------------------------- > > > > > zswpout 1,419,802 1,398,620 > > > > > sys time (sec) 535.4 613.41 > > > > > > > > systime increases without zswap cgroup charging? That's strange... > > > > > > Additional data gathered with v6.11-rc3 (listed below) based on your > > suggestion > > > to investigate potential swap.high breaches should hopefully provide > some > > > explanation. > > > > > > > > > > > > Throughput KB/s 8,671 20,045 > > > > > memcg_high events 574,046 451,859 > > > > > > > > So, on 4k folio setup, even without cgroup charge, we are still seeing: > > > > > > > > 1. More zswpout (than observed in SSD) > > > > 2. 40-50% worse latency - in fact it is worse without zswap cgroup > > charging. > > > > 3. 100 times the amount of memcg_high events? This is perhaps the > > > > *strangest* to me. You're already removing zswap cgroup charging, > then > > > > where does this comes from? How can we have memory.high violation > > when > > > > zswap does *not* contribute to memory usage? > > > > > > > > Is this due to swap limit charging? Do you have a cgroup swap limit? > > > > > > > > mem_high = page_counter_read(&memcg->memory) > > > > > READ_ONCE(memcg->memory.high); > > > > swap_high = page_counter_read(&memcg->swap) > > > > > READ_ONCE(memcg->swap.high); > > > > [...] > > > > > > > > if (mem_high || swap_high) { > > > > /* > > > > * The allocating tasks in this cgroup will need to do > > > > * reclaim or be throttled to prevent further growth > > > > * of the memory or swap footprints. > > > > * > > > > * Target some best-effort fairness between the tasks, > > > > * and distribute reclaim work and delay penalties > > > > * based on how much each task is actually allocating. > > > > */ > > > > current->memcg_nr_pages_over_high += batch; > > > > set_notify_resume(current); > > > > break; > > > > } > > > > > > > > > > I don't have a swap.high limit set on the cgroup; it is set to "max". > > > > > > I ran experiments with v6.11-rc3, no zswap charging, 4K folios and > different > > > zswap compressors to verify if swap.high is breached with the 4G SSD > > swapfile. > > > > > > SSD (CONFIG_ZSWAP is OFF): > > > > > > SSD SSD SSD > > > ------------------------------------------------------------ > > > pswpout 415,709 1,032,170 636,582 > > > sys time (sec) 301.02 328.15 306.98 > > > Throughput KB/s 155,970 89,621 122,219 > > > memcg_high events 5,068 15,072 8,344 > > > memcg_swap_high events 0 0 0 > > > memcg_swap_fail events 0 0 0 > > > ------------------------------------------------------------ > > > > > > ZSWAP zstd zstd zstd > > > ---------------------------------------------------------------- > > > zswpout 1,391,524 1,382,965 1,417,307 > > > sys time (sec) 474.68 568.24 489.80 > > > Throughput KB/s 26,099 23,404 111,115 > > > memcg_high events 335,112 340,335 162,260 > > > memcg_swap_high events 0 0 0 > > > memcg_swap_fail events 1,226,899 5,742,153 > > > (mem_cgroup_try_charge_swap) > > > memcg_memory_stat_pgactivate 1,259,547 > > > (shrink_folio_list) > > > ---------------------------------------------------------------- > > > > > > ZSWAP lzo-rle lzo-rle lzo-rle > > > ----------------------------------------------------------- > > > zswpout 1,493,917 1,363,040 1,428,133 > > > sys time (sec) 635.75 498.63 484.65 > > > Throughput KB/s 21,407 23,827 20,237 > > > memcg_high events 375,116 352,814 373,667 > > > memcg_swap_high events 0 0 0 > > > memcg_swap_fail events 715,211 > > > ----------------------------------------------------------- > > > > > > ZSWAP lz4 lz4 lz4 lz4 > > > --------------------------------------------------------------------- > > > zswpout 1,378,781 1,598,550 1,515,151 1,449,432 > > > sys time (sec) 495.45 889.36 481.21 581.22 > > > Throughput KB/s 26,248 35,176 14,765 20,253 > > > memcg_high events 347,209 321,923 412,733 369,976 > > > memcg_swap_high events 0 0 0 0 > > > memcg_swap_fail events 580,103 0 > > > --------------------------------------------------------------------- > > > > > > ZSWAP deflate-iaa deflate-iaa deflate-iaa > > > ---------------------------------------------------------------- > > > zswpout 380,471 1,440,902 1,397,965 > > > sys time (sec) 329.06 570.77 467.41 > > > Throughput KB/s 283,867 28,403 190,600 > > > memcg_high events 5,551 422,831 28,154 > > > memcg_swap_high events 0 0 0 > > > memcg_swap_fail events 0 2,686,758 438,562 > > > ---------------------------------------------------------------- > > > > Why are there 3 columns for each of the compressors? Is this different > > runs of the same workload? > > > > And why do some columns have missing cells? > > Yes, these are different runs of the same workload. Since there is some > amount of variance seen in the data, I figured it is best to publish the > metrics from the individual runs rather than averaging. > > Some of these runs were gathered earlier with the same code base, > however, I wasn't monitoring/logging the > memcg_swap_high/memcg_swap_fail > events at that time. For those runs, just these two counters have missing > column entries; the rest of the data is still valid. > > > > > > > > > There are no swap.high memcg events recorded in any of the SSD/zswap > > > experiments. However, I do see significant number of memcg_swap_fail > > > events in some of the zswap runs, for all 3 compressors. This is not > > > consistent, because there are some runs with 0 memcg_swap_fail for all > > > compressors. > > > > > > There is a possible co-relation between memcg_swap_fail events > > > (/sys/fs/cgroup/test/memory.swap.events) and the high # of > memcg_high > > > events. The root-cause appears to be that there are no available swap > > > slots, memcg_swap_fail is incremented, add_to_swap() fails in > > > shrink_folio_list(), followed by "activate_locked:" for the folio. > > > The folio re-activation is recorded in cgroup memory.stat pgactivate > > > events. The failure to swap out folios due to lack of swap slots could > > > contribute towards memory.high breaches. > > > > Yeah FWIW, that was gonna be my first suggestion. This swapfile size > > is wayyyy too small... > > > > But that said, the link is not clear to me at all. The only thing I > > can think of is lz4's performance sucks so bad that it's not saving > > enough memory, leading to regression. And since it's still taking up > > swap slot, we cannot use swap either? > > The occurrence of memcg_swap_fail events establishes that swap slots > are not available with 4G of swap space. This causes those 4K folios to > remain in memory, which can worsen an existing problem with memory.high > breaches. > > However, it is worth noting that this is not the only contributor to > memcg_high events that still occur without zswap charging. The data shows > 321,923 occurrences of memcg_high in Col 2 of the lz4 table, that also has > 0 occurrences of memcg_swap_fail reported in the cgroup stats. > > > > > > > > > However, this is probably not the only cause for either the high # of > > > memory.high breaches or the over-reclaim with zswap, as seen in the lz4 > > > data where the memory.high is significant even in cases where there are > no > > > memcg_swap_fails. > > > > > > Some observations/questions based on the above 4K folios swapout data: > > > > > > 1) There are more memcg_high events as the swapout latency reduces > > > (i.e. faster swap-write path). This is even without charging zswap > > > utilization to the cgroup. > > > > This is still inexplicable to me. If we are not charging zswap usage, > > we shouldn't even be triggering the reclaim_high() path, no? > > > > I'm curious - can you use bpftrace to tracks where/when reclaim_high > > is being called? Hi Nhat, Since reclaim_high() is called only in a handful of places, I figured I would just use debugfs u64 counters to record where it gets called from. These are the places where I increment the debugfs counters: include/linux/resume_user_mode.h: --------------------------------- diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h index e0135e0adae0..382f5469e9a2 100644 --- a/include/linux/resume_user_mode.h +++ b/include/linux/resume_user_mode.h @@ -24,6 +24,7 @@ static inline void set_notify_resume(struct task_struct *task) kick_process(task); } +extern u64 hoh_userland; /** * resume_user_mode_work - Perform work before returning to user mode @@ -56,6 +57,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs) } #endif + ++hoh_userland; mem_cgroup_handle_over_high(GFP_KERNEL); blkcg_maybe_throttle_current(); mm/memcontrol.c: ---------------- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f29157288b7d..6738bb670a78 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1910,9 +1910,12 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, return nr_reclaimed; } +extern u64 rec_high_hwf; + static void high_work_func(struct work_struct *work) { struct mem_cgroup *memcg; + ++rec_high_hwf; memcg = container_of(work, struct mem_cgroup, high_work); reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL); @@ -2055,6 +2058,8 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg, return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH; } +extern u64 rec_high_hoh; + /* * Reclaims memory over the high limit. Called directly from * try_charge() (context permitting), as well as from the userland @@ -2097,6 +2102,7 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask) * memory.high is currently batched, whereas memory.max and the page * allocator run every time an allocation is made. */ + ++rec_high_hoh; nr_reclaimed = reclaim_high(memcg, in_retry ? SWAP_CLUSTER_MAX : nr_pages, gfp_mask); @@ -2153,6 +2159,8 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask) css_put(&memcg->css); } +extern u64 hoh_trycharge; + int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages) { @@ -2344,8 +2352,10 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, */ if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH && !(current->flags & PF_MEMALLOC) && - gfpflags_allow_blocking(gfp_mask)) + gfpflags_allow_blocking(gfp_mask)) { + ++hoh_trycharge; mem_cgroup_handle_over_high(gfp_mask); + } return 0; } I reverted my debug changes for "zswap to not charge cgroup" when I ran these next set of experiments that record the # of times and locations where reclaim_high() is called. zstd is the compressor I have configured for both ZSWAP and ZRAM. 6.11-rc3 mainline, 176Gi ZRAM backing for ZSWAP, zstd, 64K mTHP: ---------------------------------------------------------------- /sys/fs/cgroup/iax/memory.events: high 112,910 hoh_userland 128,835 hoh_trycharge 0 rec_high_hoh 113,079 rec_high_hwf 0 6.11-rc3 mainline, 4G SSD backing for ZSWAP, zstd, 64K mTHP: ------------------------------------------------------------ /sys/fs/cgroup/iax/memory.events: high 4,693 hoh_userland 14,069 hoh_trycharge 0 rec_high_hoh 4,694 rec_high_hwf 0 ZSWAP-mTHP, 176Gi ZRAM backing for ZSWAP, zstd, 64K mTHP: --------------------------------------------------------- /sys/fs/cgroup/iax/memory.events: high 139,495 hoh_userland 156,628 hoh_trycharge 0 rec_high_hoh 140,039 rec_high_hwf 0 ZSWAP-mTHP, 4G SSD backing for ZSWAP, zstd, 64K mTHP: ----------------------------------------------------- /sys/fs/cgroup/iax/memory.events: high 20,427 /sys/fs/cgroup/iax/memory.swap.events: fail 20,856 hoh_userland 31,346 hoh_trycharge 0 rec_high_hoh 20,513 rec_high_hwf 0 This shows that in all cases, reclaim_high() is called only from the return path to user mode after handling a page-fault. Thanks, Kanchana > > I had confirmed earlier with counters that all calls to reclaim_high() > were from include/linux/resume_user_mode.h::resume_user_mode_work(). > I will confirm this with zstd and bpftrace and share. > > Thanks, > Kanchana > > > > > > > > > 2) There appears to be a direct co-relation between higher # of > > > memcg_swap_fail events, and an increase in memcg_high breaches and > > > reduction in usemem throughput. This combined with the observation in > > > (1) suggests that with a faster compressor, we need more swap slots, > > > that increases the probability of running out of swap slots with the 4G > > > SSD backing device. > > > > > > 3) Could the data shared earlier on reduction in memcg_high breaches > with > > > 64K mTHP swapout provide some more clues, if we agree with (1) and > (2): > > > > > > "Interestingly, the # of memcg_high events reduces significantly with > 64K > > > mTHP as compared to the above 4K memcg_high events data, when > > tested > > > with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP- > > mTHP)." > > > > > > 4) In the case of each zswap compressor, there are some runs that go > > > through with 0 memcg_swap_fail events. These runs generally have > better > > > fewer memcg_high breaches and better sys time/throughput. > > > > > > 5) For a given swap setup, there is some amount of variance in > > > sys time for this workload. > > > > > > 6) All this suggests that the primary root cause is the concurrency setup, > > > where there could be randomness between runs as to the # of processes > > > that observe the memory.high breach due to other factors such as > > > availability of swap slots for alloc. > > > > > > To summarize, I believe the root-cause is the 4G SSD swapfile resulting in > > > running out of swap slots, and anomalous behavior with over-reclaim > when > > 70 > > > concurrent processes are working with the 60G memory limit while trying > to > > > allocate 1G each; with randomness in processes reacting to the breach. > > > > > > The cgroup zswap charging exacerbates this situation, but is not a problem > > > in and of itself. > > > > > > Nhat, as you pointed out, this is somewhat of an unrealistic scenario that > > > doesn't seem to indicate any specific problems to be solved, other than > the > > > temporary cgroup zswap double-charging. > > > > > > Would it be fair to evaluate this patch-series based on a more realistic > > > swapfile configuration based on 176G ZRAM, for which I had shared the > > data > > > in v2? There weren't any problems with swap slots availability or any > > > anomalies that I can think of with this setup, other than the fact that the > > > "Before" and "After" sys times could not be directly compared for 2 key > > > reasons: > > > > > > - ZRAM compressed data is not charged to the cgroup, similar to SSD. > > > - ZSWAP compressed data is charged to the cgroup. > > > > Yeah that's a bit unfair still. Wild idea, but what about we compare > > SSD without zswap (or SSD with zswap, but without this patch series so > > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing > > swapfile on zram block device). > > > > It is stupid, I know. But let's take advantage of the fact that zram > > is not charged to cgroup, pretending that its memory foot print is > > empty? > > > > I don't know how zram works though, so my apologies if it's a stupid > > suggestion :)