RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Wed, 28 Aug 2024 07:24:18 +0000

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Sent: Tuesday, August 27, 2024 11:42 AM
> To: Nhat Pham <nphamcs@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; ryan.roberts@xxxxxxx;
> Huang, Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>;
> Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> 
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@xxxxxxxxx>
> > Sent: Tuesday, August 27, 2024 8:24 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> > hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; ryan.roberts@xxxxxxx;
> > Huang, Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K
> > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >
> > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@xxxxxxxxx> wrote:
> > >
> > > > Internally, we often see 1-3 or 1-4 saving ratio (or even more).
> > >
> > > Agree with this as well. In our experiments with other workloads, we
> > > typically see much higher ratios.
> > >
> > > >
> > > > Probably does not explain everything, but worth double checking -
> > > > could you check with zstd to see if the ratio improves.
> > >
> > > Sure. I gathered ratio and compressed memory footprint data today with
> > > 64K mTHP, the 4G SSD swapfile and different zswap compressors.
> > >
> > >  This patch-series and no zswap charging, 64K mTHP:
> > > ---------------------------------------------------------------------------
> > >                        Total         Total     Average      Average   Comp
> > >                   compressed   compression  compressed  compression  ratio
> > >                       length       latency      length      latency
> > >                        bytes  milliseconds       bytes  nanoseconds
> > > ---------------------------------------------------------------------------
> > > SSD (no zswap) 1,362,296,832       887,861
> > > lz4            2,610,657,430        55,984       2,055      44,065    1.99
> > > zstd             729,129,528        50,986         565      39,510    7.25
> > > deflate-iaa    1,286,533,438        44,785       1,415      49,252    2.89
> > > ---------------------------------------------------------------------------
> > >
> > > zstd does very well on ratio, as expected.
> >
> > Wait. So zstd is displaying 7-to-1 compression ratio? And has *lower*
> > average latency?
> >
> > Why are we running benchmark on lz4 again? Sure there is no free lunch
> > and no compressor that works well on all kind of data, but lz4's
> > performance here is so bad that it's borderline justifiable to
> > disable/bypass zswap with this kind of compresison ratio...
> >
> > Can I ask you to run benchmarking on zstd from now on?
> 
> Sure, will do.
> 
> >
> > >
> > > >
> > > > >
> > > > >
> > > > >  Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
> > > > >  ------------------------------------------------------------
> > > > >
> > > > >  I wanted to take a step back and understand how the mainline v6.11-
> > rc3
> > > > >  handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off)
> and
> > > > when
> > > > >  swapped out to ZSWAP. Interestingly, higher swapout activity is
> > observed
> > > > >  with 4K folios and v6.11-rc3 (with the debug change to not charge
> > zswap to
> > > > >  cgroup).
> > > > >
> > > > >  v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:
> > > > >
> > > > >  -------------------------------------------------------------
> > > > >  SSD (CONFIG_ZSWAP is OFF)       ZSWAP          lz4    lzo-rle
> > > > >  -------------------------------------------------------------
> > > > >  cgroup memory.events:           cgroup memory.events:
> > > > >
> > > > >  low                 0           low              0          0
> > > > >  high            5,068           high       321,923    375,116
> > > > >  max                 0           max              0          0
> > > > >  oom                 0           oom              0          0
> > > > >  oom_kill            0           oom_kill         0          0
> > > > >  oom_group_kill      0           oom_group_kill   0          0
> > > > >  -------------------------------------------------------------
> > > > >
> > > > >  SSD (CONFIG_ZSWAP is OFF):
> > > > >  --------------------------
> > > > >  pswpout            415,709
> > > > >  sys time (sec)      301.02
> > > > >  Throughput KB/s    155,970
> > > > >  memcg_high events    5,068
> > > > >  --------------------------
> > > > >
> > > > >
> > > > >  ZSWAP                  lz4         lz4         lz4     lzo-rle
> > > > >  --------------------------------------------------------------
> > > > >  zswpout          1,598,550   1,515,151   1,449,432   1,493,917
> > > > >  sys time (sec)      889.36      481.21      581.22      635.75
> > > > >  Throughput KB/s     35,176      14,765      20,253      21,407
> > > > >  memcg_high events  321,923     412,733     369,976     375,116
> > > > >  --------------------------------------------------------------
> > > > >
> > > > >  This shows that there is a performance regression of -60% to -195%
> > with
> > > > >  zswap as compared to SSD with 4K folios. The higher swapout activity
> > with
> > > > >  zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).
> > > > >
> > > > >  I verified this to be the case even with the v6.7 kernel, which also
> > > > >  showed a 2.3X throughput improvement when we don't charge
> zswap:
> > > > >
> > > > >  ZSWAP lz4                 v6.7      v6.7 with no cgroup zswap charge
> > > > >  --------------------------------------------------------------------
> > > > >  zswpout              1,419,802       1,398,620
> > > > >  sys time (sec)           535.4          613.41
> > > >
> > > > systime increases without zswap cgroup charging? That's strange...
> > >
> > > Additional data gathered with v6.11-rc3 (listed below) based on your
> > suggestion
> > > to investigate potential swap.high breaches should hopefully provide
> some
> > > explanation.
> > >
> > > >
> > > > >  Throughput KB/s          8,671          20,045
> > > > >  memcg_high events      574,046         451,859
> > > >
> > > > So, on 4k folio setup, even without cgroup charge, we are still seeing:
> > > >
> > > > 1. More zswpout (than observed in SSD)
> > > > 2. 40-50% worse latency - in fact it is worse without zswap cgroup
> > charging.
> > > > 3. 100 times the amount of memcg_high events? This is perhaps the
> > > > *strangest* to me. You're already removing zswap cgroup charging,
> then
> > > > where does this comes from? How can we have memory.high violation
> > when
> > > > zswap does *not* contribute to memory usage?
> > > >
> > > > Is this due to swap limit charging? Do you have a cgroup swap limit?
> > > >
> > > > mem_high = page_counter_read(&memcg->memory) >
> > > >            READ_ONCE(memcg->memory.high);
> > > > swap_high = page_counter_read(&memcg->swap) >
> > > >            READ_ONCE(memcg->swap.high);
> > > > [...]
> > > >
> > > > if (mem_high || swap_high) {
> > > >     /*
> > > >     * The allocating tasks in this cgroup will need to do
> > > >     * reclaim or be throttled to prevent further growth
> > > >     * of the memory or swap footprints.
> > > >     *
> > > >     * Target some best-effort fairness between the tasks,
> > > >     * and distribute reclaim work and delay penalties
> > > >     * based on how much each task is actually allocating.
> > > >     */
> > > >     current->memcg_nr_pages_over_high += batch;
> > > >     set_notify_resume(current);
> > > >     break;
> > > > }
> > > >
> > >
> > > I don't have a swap.high limit set on the cgroup; it is set to "max".
> > >
> > > I ran experiments with v6.11-rc3, no zswap charging, 4K folios and
> different
> > > zswap compressors to verify if swap.high is breached with the 4G SSD
> > swapfile.
> > >
> > >  SSD (CONFIG_ZSWAP is OFF):
> > >
> > >                                 SSD          SSD          SSD
> > >  ------------------------------------------------------------
> > >  pswpout                    415,709    1,032,170      636,582
> > >  sys time (sec)              301.02       328.15       306.98
> > >  Throughput KB/s            155,970       89,621      122,219
> > >  memcg_high events            5,068       15,072        8,344
> > >  memcg_swap_high events           0            0            0
> > >  memcg_swap_fail events           0            0            0
> > >  ------------------------------------------------------------
> > >
> > >  ZSWAP                               zstd         zstd       zstd
> > >  ----------------------------------------------------------------
> > >  zswpout                        1,391,524    1,382,965  1,417,307
> > >  sys time (sec)                    474.68       568.24     489.80
> > >  Throughput KB/s                   26,099       23,404    111,115
> > >  memcg_high events                335,112      340,335    162,260
> > >  memcg_swap_high events                 0            0          0
> > >  memcg_swap_fail events         1,226,899    5,742,153
> > >   (mem_cgroup_try_charge_swap)
> > >  memcg_memory_stat_pgactivate   1,259,547
> > >   (shrink_folio_list)
> > >  ----------------------------------------------------------------
> > >
> > >  ZSWAP                      lzo-rle      lzo-rle     lzo-rle
> > >  -----------------------------------------------------------
> > >  zswpout                  1,493,917    1,363,040   1,428,133
> > >  sys time (sec)              635.75       498.63      484.65
> > >  Throughput KB/s             21,407       23,827      20,237
> > >  memcg_high events          375,116      352,814     373,667
> > >  memcg_swap_high events           0            0           0
> > >  memcg_swap_fail events     715,211
> > >  -----------------------------------------------------------
> > >
> > >  ZSWAP                         lz4         lz4        lz4          lz4
> > >  ---------------------------------------------------------------------
> > >  zswpout                 1,378,781   1,598,550   1,515,151   1,449,432
> > >  sys time (sec)             495.45      889.36      481.21      581.22
> > >  Throughput KB/s            26,248      35,176      14,765      20,253
> > >  memcg_high events         347,209     321,923     412,733     369,976
> > >  memcg_swap_high events          0           0           0           0
> > >  memcg_swap_fail events    580,103           0
> > >  ---------------------------------------------------------------------
> > >
> > >  ZSWAP                  deflate-iaa   deflate-iaa    deflate-iaa
> > >  ----------------------------------------------------------------
> > >  zswpout                    380,471     1,440,902      1,397,965
> > >  sys time (sec)              329.06        570.77         467.41
> > >  Throughput KB/s            283,867        28,403        190,600
> > >  memcg_high events            5,551       422,831         28,154
> > >  memcg_swap_high events           0             0              0
> > >  memcg_swap_fail events           0     2,686,758        438,562
> > >  ----------------------------------------------------------------
> >
> > Why are there 3 columns for each of the compressors? Is this different
> > runs of the same workload?
> >
> > And why do some columns have missing cells?
> 
> Yes, these are different runs of the same workload. Since there is some
> amount of variance seen in the data, I figured it is best to publish the
> metrics from the individual runs rather than averaging.
> 
> Some of these runs were gathered earlier with the same code base,
> however, I wasn't monitoring/logging the
> memcg_swap_high/memcg_swap_fail
> events at that time. For those runs, just these two counters have missing
> column entries; the rest of the data is still valid.
> 
> >
> > >
> > > There are no swap.high memcg events recorded in any of the SSD/zswap
> > >  experiments. However, I do see significant number of memcg_swap_fail
> > >  events in some of the zswap runs, for all 3 compressors. This is not
> > >  consistent, because there are some runs with 0 memcg_swap_fail for all
> > >  compressors.
> > >
> > >  There is a possible co-relation between memcg_swap_fail events
> > >  (/sys/fs/cgroup/test/memory.swap.events) and the high # of
> memcg_high
> > >  events. The root-cause appears to be that there are no available swap
> > >  slots, memcg_swap_fail is incremented, add_to_swap() fails in
> > >  shrink_folio_list(), followed by "activate_locked:" for the folio.
> > >  The folio re-activation is recorded in cgroup memory.stat pgactivate
> > >  events. The failure to swap out folios due to lack of swap slots could
> > >  contribute towards memory.high breaches.
> >
> > Yeah FWIW, that was gonna be my first suggestion. This swapfile size
> > is wayyyy too small...
> >
> > But that said, the link is not clear to me at all. The only thing I
> > can think of is lz4's performance sucks so bad that it's not saving
> > enough memory, leading to regression. And since it's still taking up
> > swap slot, we cannot use swap either?
> 
> The occurrence of memcg_swap_fail events establishes that swap slots
> are not available with 4G of swap space. This causes those 4K folios to
> remain in memory, which can worsen an existing problem with memory.high
> breaches.
> 
> However, it is worth noting that this is not the only contributor to
> memcg_high events that still occur without zswap charging. The data shows
> 321,923 occurrences of memcg_high in Col 2 of the lz4 table, that also has
> 0 occurrences of memcg_swap_fail reported in the cgroup stats.
> 
> >
> > >
> > >  However, this is probably not the only cause for either the high # of
> > >  memory.high breaches or the over-reclaim with zswap, as seen in the lz4
> > >  data where the memory.high is significant even in cases where there are
> no
> > >  memcg_swap_fails.
> > >
> > > Some observations/questions based on the above 4K folios swapout data:
> > >
> > > 1) There are more memcg_high events as the swapout latency reduces
> > >    (i.e. faster swap-write path). This is even without charging zswap
> > >    utilization to the cgroup.
> >
> > This is still inexplicable to me. If we are not charging zswap usage,
> > we shouldn't even be triggering the reclaim_high() path, no?
> >
> > I'm curious - can you use bpftrace to tracks where/when reclaim_high
> > is being called?

Hi Nhat,

Since reclaim_high() is called only in a handful of places, I figured I
would just use debugfs u64 counters to record where it gets called from.

These are the places where I increment the debugfs counters:

include/linux/resume_user_mode.h:
---------------------------------

diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
index e0135e0adae0..382f5469e9a2 100644
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -24,6 +24,7 @@ static inline void set_notify_resume(struct task_struct *task)
 		kick_process(task);
 }
 
+extern u64 hoh_userland;
 
 /**
  * resume_user_mode_work - Perform work before returning to user mode
@@ -56,6 +57,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
 	}
 #endif
 
+	++hoh_userland;
 	mem_cgroup_handle_over_high(GFP_KERNEL);
 	blkcg_maybe_throttle_current();
 

mm/memcontrol.c:
----------------
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f29157288b7d..6738bb670a78 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1910,9 +1910,12 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
 	return nr_reclaimed;
 }
 
+extern u64 rec_high_hwf;
+
 static void high_work_func(struct work_struct *work)
 {
 	struct mem_cgroup *memcg;
+	++rec_high_hwf;
 
 	memcg = container_of(work, struct mem_cgroup, high_work);
 	reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
@@ -2055,6 +2058,8 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 	return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
 }
 
+extern u64 rec_high_hoh;
+
 /*
  * Reclaims memory over the high limit. Called directly from
  * try_charge() (context permitting), as well as from the userland
@@ -2097,6 +2102,7 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	 * memory.high is currently batched, whereas memory.max and the page
 	 * allocator run every time an allocation is made.
 	 */
+	++rec_high_hoh;
 	nr_reclaimed = reclaim_high(memcg,
 				    in_retry ? SWAP_CLUSTER_MAX : nr_pages,
 				    gfp_mask);
@@ -2153,6 +2159,8 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	css_put(&memcg->css);
 }
 
+extern u64 hoh_trycharge;
+
 int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		     unsigned int nr_pages)
 {
@@ -2344,8 +2352,10 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
 	    !(current->flags & PF_MEMALLOC) &&
-	    gfpflags_allow_blocking(gfp_mask))
+	    gfpflags_allow_blocking(gfp_mask)) {
+		++hoh_trycharge;
 		mem_cgroup_handle_over_high(gfp_mask);
+	}
 	return 0;
 }
 

I reverted my debug changes for "zswap to not charge cgroup" when I ran
these next set of experiments that record the # of times and locations
where reclaim_high() is called.

zstd is the compressor I have configured for both ZSWAP and ZRAM.

 6.11-rc3 mainline, 176Gi ZRAM backing for ZSWAP, zstd, 64K mTHP:
 ----------------------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 112,910

 hoh_userland 128,835
 hoh_trycharge 0
 rec_high_hoh 113,079
 rec_high_hwf 0

 6.11-rc3 mainline, 4G SSD backing for ZSWAP, zstd, 64K mTHP:
 ------------------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 4,693
 
 hoh_userland 14,069
 hoh_trycharge 0
 rec_high_hoh 4,694
 rec_high_hwf 0


 ZSWAP-mTHP, 176Gi ZRAM backing for ZSWAP, zstd, 64K mTHP:
 ---------------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 139,495
 
 hoh_userland 156,628
 hoh_trycharge 0
 rec_high_hoh 140,039
 rec_high_hwf 0

 ZSWAP-mTHP, 4G SSD backing for ZSWAP, zstd, 64K mTHP:
 -----------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 20,427
 
 /sys/fs/cgroup/iax/memory.swap.events:
 fail 20,856
 
 hoh_userland 31,346
 hoh_trycharge 0
 rec_high_hoh 20,513
 rec_high_hwf 0

This shows that in all cases, reclaim_high() is called only from the return
path to user mode after handling a page-fault.

Thanks,
Kanchana

> 
> I had confirmed earlier with counters that all calls to reclaim_high()
> were from include/linux/resume_user_mode.h::resume_user_mode_work().
> I will confirm this with zstd and bpftrace and share.
> 
> Thanks,
> Kanchana
> 
> >
> > >
> > > 2) There appears to be a direct co-relation between higher # of
> > >    memcg_swap_fail events, and an increase in memcg_high breaches and
> > >    reduction in usemem throughput. This combined with the observation in
> > >    (1) suggests that with a faster compressor, we need more swap slots,
> > >    that increases the probability of running out of swap slots with the 4G
> > >    SSD backing device.
> > >
> > > 3) Could the data shared earlier on reduction in memcg_high breaches
> with
> > >    64K mTHP swapout provide some more clues, if we agree with (1) and
> (2):
> > >
> > >    "Interestingly, the # of memcg_high events reduces significantly with
> 64K
> > >    mTHP as compared to the above 4K memcg_high events data, when
> > tested
> > >    with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-
> > mTHP)."
> > >
> > > 4) In the case of each zswap compressor, there are some runs that go
> > >    through with 0 memcg_swap_fail events. These runs generally have
> better
> > >    fewer memcg_high breaches and better sys time/throughput.
> > >
> > > 5) For a given swap setup, there is some amount of variance in
> > >    sys time for this workload.
> > >
> > > 6) All this suggests that the primary root cause is the concurrency setup,
> > >    where there could be randomness between runs as to the # of processes
> > >    that observe the memory.high breach due to other factors such as
> > >    availability of swap slots for alloc.
> > >
> > > To summarize, I believe the root-cause is the 4G SSD swapfile resulting in
> > > running out of swap slots, and anomalous behavior with over-reclaim
> when
> > 70
> > > concurrent processes are working with the 60G memory limit while trying
> to
> > > allocate 1G each; with randomness in processes reacting to the breach.
> > >
> > > The cgroup zswap charging exacerbates this situation, but is not a problem
> > > in and of itself.
> > >
> > > Nhat, as you pointed out, this is somewhat of an unrealistic scenario that
> > > doesn't seem to indicate any specific problems to be solved, other than
> the
> > > temporary cgroup zswap double-charging.
> > >
> > > Would it be fair to evaluate this patch-series based on a more realistic
> > > swapfile configuration based on 176G ZRAM, for which I had shared the
> > data
> > > in v2? There weren't any problems with swap slots availability or any
> > > anomalies that I can think of with this setup, other than the fact that the
> > > "Before" and "After" sys times could not be directly compared for 2 key
> > > reasons:
> > >
> > >  - ZRAM compressed data is not charged to the cgroup, similar to SSD.
> > >  - ZSWAP compressed data is charged to the cgroup.
> >
> > Yeah that's a bit unfair still. Wild idea, but what about we compare
> > SSD without zswap (or SSD with zswap, but without this patch series so
> > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
> > swapfile on zram block device).
> >
> > It is stupid, I know. But let's take advantage of the fact that zram
> > is not charged to cgroup, pretending that its memory foot print is
> > empty?
> >
> > I don't know how zram works though, so my apologies if it's a stupid
> > suggestion :)