RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Tue, 27 Aug 2024 06:08:23 +0000

Hi Nhat,

> -----Original Message-----
> From: Nhat Pham <nphamcs@xxxxxxxxx>
> Sent: Monday, August 26, 2024 7:12 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; ryan.roberts@xxxxxxx;
> Huang, Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Fri, Aug 23, 2024 at 11:21 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@xxxxxxxxx> wrote:
> >
> > Hi Nhat,
> >
> >
> > I started out with 2 main hypotheses to explain why zswap incurs more
> > reclaim wrt SSD:
> >
> > 1) The cgroup zswap charge, that hastens the memory.high limit to be
> >    breached, and adds to the reclaim being triggered in
> >    mem_cgroup_handle_over_high().
> >
> > 2) Does a faster reclaim path somehow cause less allocation stalls; thereby
> >    causing more breaches of memory.high, hence more reclaim -- and does
> this
> >    cycle repeat, potentially leading to higher swapout activity with zswap?
> 
> By faster reclaim path, do you mean zswap has a lower reclaim latency?

Thanks for your follow-up comments/suggestions. Yes, I was characterizing
lower zswap reclaim latency as faster reclaim path.

> 
> >
> > I focused on gathering data with lz4 for this debug, under the reasonable
> > assumption that results with deflate-iaa will be better. Once we figure out
> > an overall direction on next steps, I will publish results with zswap lz4,
> > deflate-iaa, etc.
> >
> > All experiments except "Exp 1.A" are run with
> > usemem --init-time -w -O -n 70 1g.
> >
> > General settings for all data presented in this patch-series:
> >
> > vm.swappiness = 100
> > zswap shrinker_enabled = N
> >
> >  Experiment 1 - impact of not doing cgroup zswap charge:
> >  -------------------------------------------------------
> >
> > I wanted to first understand by how much we improve without the cgroup
> > zswap charge. I commented out both, the calls to
> obj_cgroup_charge_zswap()
> > and obj_cgroup_uncharge_zswap() in zswap.c in my patch-set.
> > We improve throughput by quite a bit with this change, and are now better
> > than mTHP getting swapped out to SSD. We have also slightly improved on
> the
> > sys time, though this is still a regression as compared to SSD. If you
> > recall, we were worse on throughput and sys time with v4.
> 
> I'm not 100% sure about the validity this pair of experiments.
> 
> The thing is, you cannot ignore zswap's memory footprint altogether.
> That's the whole point of the trade-off. It's probably gigabytes worth
> of unaccounted memory usage - I see that your SSD size is 4G, and
> since compression ratio is less than 2, that's potentially 2G worth of
> memory give or take you are not charging to the cgroup, which can
> altogether alter the memory pressure and reclaim dynamics.

I agree, the zswap memory utilization charging to the cgroup is the right
thing to do (assuming we solve the temporary double-charging, as Yosry
and you have pointed out). I have summarized the zswap memory footprint
with different compressors in the results towards the end of this email.

> 
> The zswap charging itself is not the problem - that's fair and
> healthy. It might be the overreaction by the memory reclaim subsystem
> that seems anomalous?

I think so too, about the anomalous behavior.

> 
> >
> > Averages over 3 runs are summarized in each case.
> >
> >  Exp 1.A: usemem -n 1 58g:
> >  -------------------------
> >
> >  64KB mTHP (cgroup memory.high set to 60G):
> >  ==========================================
> >
> >                 SSD mTHP    zswap mTHP v4   zswap mTHP no_charge
> >  ----------------------------------------------------------------
> >  pswpout          586,352                0                      0
> >  zswpout            1,005        1,042,963                587,181
> >  ----------------------------------------------------------------
> >  Total swapout    587,357        1,042,963                587,181
> >  ----------------------------------------------------------------
> >
> > Without the zswap charge to cgroup, the total swapout activity for
> > zswap-mTHP is on par with that of SSD-mTHP for the single process case.
> >
> >
> >  Exp 1.B: usemem -n 70 1g:
> >  -------------------------
> >  v4 results with cgroup zswap charge:
> >  ------------------------------------
> >
> >  64KB mTHP (cgroup memory.high set to 60G):
> >  ==========================================
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput |     Change |
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time |     Change |
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
> >  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
> >   ------------------------------------------------------------------
> >
> >   -----------------------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-
> mTHP |
> >  |                              |   mainline |       Store |       Store |
> >  |                              |            |         lz4 | deflate-iaa |
> >  |-----------------------------------------------------------------------|
> >  | pswpout                      |    174,432 |           0 |           0 |
> >  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
> >   -----------------------------------------------------------------------
> >
> >  Debug results without cgroup zswap charge in both, "Before" and "After":
> >  ------------------------------------------------------------------------
> >
> >  64KB mTHP (cgroup memory.high set to 60G):
> >  ==========================================
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput |     Change |
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |    300,565 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |    420,125 |        40% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time |     Change |
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |      90.76 |   Baseline |
> >  |zswap-mTHP=Store    | ZSWAP lz4         |     213.09 |      -135% |
> >   ------------------------------------------------------------------
> >
> >   ---------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |
> >  |                              |   mainline |       Store |
> >  |                              |            |         lz4 |
> >  |----------------------------------------------------------
> >  | pswpout                      |    330,640 |           0 |
> >  | zswpout                      |      1,527 |   1,384,725 |
> >  |----------------------------------------------------------
> >  | hugepages-64kB/stats/zswpout |            |      63,335 |
> >  |----------------------------------------------------------
> >  | hugepages-64kB/stats/swpout  |     18,242 |           0 |
> >   ---------------------------------------------------------
> >
> 
> Hmm, in the 70 processes case, it looks like we're still seeing
> latency regression, and that same pattern of overreclaiming, even
> without zswap cgroup charging?
> 
> That seems like a hint - concurrency exacerbates the problem?

Agreed, that was my conclusion as well.

> 
> >
> > Based on these results, I kept the cgroup zswap charging commented out in
> > subsequent debug steps, so as to not place zswap at a disadvantage when
> > trying to determine further causes for hypothesis (1).
> >
> >
> > Experiment 2 - swap latency/reclamation with 64K mTHP:
> > ------------------------------------------------------
> >
> > Number of swap_writepage    Total swap_writepage  Average
> swap_writepage
> >     calls from all cores      Latency (millisec)      Latency (microsec)
> > ---------------------------------------------------------------------------
> > SSD               21,373               165,434.9                   7,740
> > zswap            344,109                55,446.8                     161
> > ---------------------------------------------------------------------------
> >
> >
> > Reclamation analysis: 64k mTHP swapout:
> > ---------------------------------------
> > "Before":
> >   Total SSD compressed data size   =  1,362,296,832  bytes
> >   Total SSD write IO latency       =        887,861  milliseconds
> >
> >   Average SSD compressed data size =      1,089,837  bytes
> >   Average SSD write IO latency     =        710,289  microseconds
> >
> > "After":
> >   Total ZSWAP compressed pool size =  2,610,657,430  bytes
> >   Total ZSWAP compress latency     =         55,984  milliseconds
> >
> >   Average ZSWAP compress length    =          2,055  bytes
> >   Average ZSWAP compress latency   =             44  microseconds
> >
> >   zswap-LZ4 mTHP compression ratio =  1.99
> >   All moderately compressible pages. 0 zswap_store errors.
> >   84% of pages compress to 2056 bytes.
> 
> Hmm this ratio isn't very good indeed - it is less than 2-to-1 memory saving...
> 
> Internally, we often see 1-3 or 1-4 saving ratio (or even more).

Agree with this as well. In our experiments with other workloads, we
typically see much higher ratios.

> 
> Probably does not explain everything, but worth double checking -
> could you check with zstd to see if the ratio improves.

Sure. I gathered ratio and compressed memory footprint data today with
64K mTHP, the 4G SSD swapfile and different zswap compressors.

 This patch-series and no zswap charging, 64K mTHP:
---------------------------------------------------------------------------
                       Total         Total     Average      Average   Comp
                  compressed   compression  compressed  compression  ratio
                      length       latency      length      latency
                       bytes  milliseconds       bytes  nanoseconds
---------------------------------------------------------------------------
SSD (no zswap) 1,362,296,832       887,861  
lz4            2,610,657,430        55,984       2,055      44,065    1.99
zstd             729,129,528        50,986         565      39,510    7.25
deflate-iaa    1,286,533,438        44,785       1,415      49,252    2.89
---------------------------------------------------------------------------

zstd does very well on ratio, as expected.

> 
> >
> >
> >  Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
> >  ------------------------------------------------------------
> >
> >  I wanted to take a step back and understand how the mainline v6.11-rc3
> >  handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and
> when
> >  swapped out to ZSWAP. Interestingly, higher swapout activity is observed
> >  with 4K folios and v6.11-rc3 (with the debug change to not charge zswap to
> >  cgroup).
> >
> >  v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:
> >
> >  -------------------------------------------------------------
> >  SSD (CONFIG_ZSWAP is OFF)       ZSWAP          lz4    lzo-rle
> >  -------------------------------------------------------------
> >  cgroup memory.events:           cgroup memory.events:
> >
> >  low                 0           low              0          0
> >  high            5,068           high       321,923    375,116
> >  max                 0           max              0          0
> >  oom                 0           oom              0          0
> >  oom_kill            0           oom_kill         0          0
> >  oom_group_kill      0           oom_group_kill   0          0
> >  -------------------------------------------------------------
> >
> >  SSD (CONFIG_ZSWAP is OFF):
> >  --------------------------
> >  pswpout            415,709
> >  sys time (sec)      301.02
> >  Throughput KB/s    155,970
> >  memcg_high events    5,068
> >  --------------------------
> >
> >
> >  ZSWAP                  lz4         lz4         lz4     lzo-rle
> >  --------------------------------------------------------------
> >  zswpout          1,598,550   1,515,151   1,449,432   1,493,917
> >  sys time (sec)      889.36      481.21      581.22      635.75
> >  Throughput KB/s     35,176      14,765      20,253      21,407
> >  memcg_high events  321,923     412,733     369,976     375,116
> >  --------------------------------------------------------------
> >
> >  This shows that there is a performance regression of -60% to -195% with
> >  zswap as compared to SSD with 4K folios. The higher swapout activity with
> >  zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).
> >
> >  I verified this to be the case even with the v6.7 kernel, which also
> >  showed a 2.3X throughput improvement when we don't charge zswap:
> >
> >  ZSWAP lz4                 v6.7      v6.7 with no cgroup zswap charge
> >  --------------------------------------------------------------------
> >  zswpout              1,419,802       1,398,620
> >  sys time (sec)           535.4          613.41
> 
> systime increases without zswap cgroup charging? That's strange...

Additional data gathered with v6.11-rc3 (listed below) based on your suggestion
to investigate potential swap.high breaches should hopefully provide some
explanation.

> 
> >  Throughput KB/s          8,671          20,045
> >  memcg_high events      574,046         451,859
> 
> So, on 4k folio setup, even without cgroup charge, we are still seeing:
> 
> 1. More zswpout (than observed in SSD)
> 2. 40-50% worse latency - in fact it is worse without zswap cgroup charging.
> 3. 100 times the amount of memcg_high events? This is perhaps the
> *strangest* to me. You're already removing zswap cgroup charging, then
> where does this comes from? How can we have memory.high violation when
> zswap does *not* contribute to memory usage?
> 
> Is this due to swap limit charging? Do you have a cgroup swap limit?
> 
> mem_high = page_counter_read(&memcg->memory) >
>            READ_ONCE(memcg->memory.high);
> swap_high = page_counter_read(&memcg->swap) >
>            READ_ONCE(memcg->swap.high);
> [...]
> 
> if (mem_high || swap_high) {
>     /*
>     * The allocating tasks in this cgroup will need to do
>     * reclaim or be throttled to prevent further growth
>     * of the memory or swap footprints.
>     *
>     * Target some best-effort fairness between the tasks,
>     * and distribute reclaim work and delay penalties
>     * based on how much each task is actually allocating.
>     */
>     current->memcg_nr_pages_over_high += batch;
>     set_notify_resume(current);
>     break;
> }
> 

I don't have a swap.high limit set on the cgroup; it is set to "max".

I ran experiments with v6.11-rc3, no zswap charging, 4K folios and different
zswap compressors to verify if swap.high is breached with the 4G SSD swapfile.

 SSD (CONFIG_ZSWAP is OFF):

                                SSD          SSD          SSD
 ------------------------------------------------------------ 
 pswpout                    415,709    1,032,170      636,582
 sys time (sec)              301.02       328.15       306.98
 Throughput KB/s            155,970       89,621      122,219
 memcg_high events            5,068       15,072        8,344
 memcg_swap_high events           0            0            0
 memcg_swap_fail events           0            0            0
 ------------------------------------------------------------

 ZSWAP                               zstd         zstd       zstd
 ----------------------------------------------------------------
 zswpout                        1,391,524    1,382,965  1,417,307
 sys time (sec)                    474.68       568.24     489.80
 Throughput KB/s                   26,099       23,404    111,115
 memcg_high events                335,112      340,335    162,260
 memcg_swap_high events                 0            0          0
 memcg_swap_fail events         1,226,899    5,742,153
  (mem_cgroup_try_charge_swap)
 memcg_memory_stat_pgactivate   1,259,547
  (shrink_folio_list)
 ----------------------------------------------------------------

 ZSWAP                      lzo-rle      lzo-rle     lzo-rle
 -----------------------------------------------------------
 zswpout                  1,493,917    1,363,040   1,428,133
 sys time (sec)              635.75       498.63      484.65
 Throughput KB/s             21,407       23,827      20,237
 memcg_high events          375,116      352,814     373,667
 memcg_swap_high events           0            0           0
 memcg_swap_fail events     715,211      
 -----------------------------------------------------------

 ZSWAP                         lz4         lz4        lz4          lz4
 ---------------------------------------------------------------------
 zswpout                 1,378,781   1,598,550   1,515,151   1,449,432
 sys time (sec)             495.45      889.36      481.21      581.22
 Throughput KB/s            26,248      35,176      14,765      20,253
 memcg_high events         347,209     321,923     412,733     369,976
 memcg_swap_high events          0           0           0           0
 memcg_swap_fail events    580,103           0 
 ---------------------------------------------------------------------

 ZSWAP                  deflate-iaa   deflate-iaa    deflate-iaa
 ----------------------------------------------------------------
 zswpout                    380,471     1,440,902      1,397,965
 sys time (sec)              329.06        570.77         467.41
 Throughput KB/s            283,867        28,403        190,600
 memcg_high events            5,551       422,831         28,154
 memcg_swap_high events           0             0              0
 memcg_swap_fail events           0     2,686,758        438,562
 ----------------------------------------------------------------

There are no swap.high memcg events recorded in any of the SSD/zswap
 experiments. However, I do see significant number of memcg_swap_fail
 events in some of the zswap runs, for all 3 compressors. This is not
 consistent, because there are some runs with 0 memcg_swap_fail for all
 compressors.

 There is a possible co-relation between memcg_swap_fail events
 (/sys/fs/cgroup/test/memory.swap.events) and the high # of memcg_high
 events. The root-cause appears to be that there are no available swap
 slots, memcg_swap_fail is incremented, add_to_swap() fails in
 shrink_folio_list(), followed by "activate_locked:" for the folio.
 The folio re-activation is recorded in cgroup memory.stat pgactivate
 events. The failure to swap out folios due to lack of swap slots could
 contribute towards memory.high breaches.

swp_entry_t folio_alloc_swap(struct folio *folio)
{
...
        get_swap_pages(1, &entry, 0);
out:
	if (mem_cgroup_try_charge_swap(folio, entry)) {
		put_swap_folio(folio, entry);
		entry.val = 0;
	}
	return entry;
}

int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
{
...
	if (!entry.val) {
		WARN_ONCE(1, "__mem_cgroup_try_charge_swap: MEMCG_SWAP_FAIL entry.val is 0");
		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
		return 0;
	}

...
}

This is the call stack (v6.11-rc3 mainline) as reference for the above
analysis:

[  109.130504] __mem_cgroup_try_charge_swap: MEMCG_SWAP_FAIL entry.val is 0
[  109.130515] WARNING: CPU: 143 PID: 5200 at mm/memcontrol.c:5011 __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130652] RIP: 0010:__mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130682] Call Trace:
[  109.130686]  <TASK>
[  109.130689] ? __warn (kernel/panic.c:735) 
[  109.130695] ? __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130698] ? report_bug (lib/bug.c:201 lib/bug.c:219) 
[  109.130705] ? prb_read_valid (kernel/printk/printk_ringbuffer.c:2183) 
[  109.130710] ? handle_bug (arch/x86/kernel/traps.c:239) 
[  109.130715] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1)) 
[  109.130718] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) 
[  109.130722] ? __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130725] ? __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130728] folio_alloc_swap (mm/swap_slots.c:348) 
[  109.130734] add_to_swap (mm/swap_state.c:189) 
[  109.130737] shrink_folio_list (mm/vmscan.c:1235) 
[  109.130744] ? __mod_zone_page_state (mm/vmstat.c:367) 
[  109.130748] ? isolate_lru_folios (mm/vmscan.c:1598 mm/vmscan.c:1736) 
[  109.130753] shrink_inactive_list (./include/linux/spinlock.h:376 mm/vmscan.c:1961) 
[  109.130758] shrink_lruvec (mm/vmscan.c:2194 mm/vmscan.c:5706) 
[  109.130763] shrink_node (mm/vmscan.c:5910 mm/vmscan.c:5948) 
[  109.130768] do_try_to_free_pages (mm/vmscan.c:6134 mm/vmscan.c:6254) 
[  109.130772] try_to_free_mem_cgroup_pages (./include/linux/sched/mm.h:355 ./include/linux/sched/mm.h:456 mm/vmscan.c:6588) 
[  109.130778] reclaim_high (mm/memcontrol.c:1906) 
[  109.130783] mem_cgroup_handle_over_high (./include/linux/memcontrol.h:556 mm/memcontrol.c:2001 mm/memcontrol.c:2108) 
[  109.130787] irqentry_exit_to_user_mode (./include/linux/resume_user_mode.h:60 kernel/entry/common.c:114 ./include/linux/entry-common.h:328 kernel/entry/common.c:231) 
[  109.130792] asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) 

 However, this is probably not the only cause for either the high # of
 memory.high breaches or the over-reclaim with zswap, as seen in the lz4
 data where the memory.high is significant even in cases where there are no
 memcg_swap_fails.

Some observations/questions based on the above 4K folios swapout data:

1) There are more memcg_high events as the swapout latency reduces
   (i.e. faster swap-write path). This is even without charging zswap
   utilization to the cgroup.

2) There appears to be a direct co-relation between higher # of
   memcg_swap_fail events, and an increase in memcg_high breaches and
   reduction in usemem throughput. This combined with the observation in
   (1) suggests that with a faster compressor, we need more swap slots,
   that increases the probability of running out of swap slots with the 4G
   SSD backing device.

3) Could the data shared earlier on reduction in memcg_high breaches with
   64K mTHP swapout provide some more clues, if we agree with (1) and (2):

   "Interestingly, the # of memcg_high events reduces significantly with 64K
   mTHP as compared to the above 4K memcg_high events data, when tested
   with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP)."

4) In the case of each zswap compressor, there are some runs that go
   through with 0 memcg_swap_fail events. These runs generally have better
   fewer memcg_high breaches and better sys time/throughput.

5) For a given swap setup, there is some amount of variance in
   sys time for this workload.

6) All this suggests that the primary root cause is the concurrency setup,
   where there could be randomness between runs as to the # of processes
   that observe the memory.high breach due to other factors such as
   availability of swap slots for alloc.

To summarize, I believe the root-cause is the 4G SSD swapfile resulting in
running out of swap slots, and anomalous behavior with over-reclaim when 70
concurrent processes are working with the 60G memory limit while trying to
allocate 1G each; with randomness in processes reacting to the breach.

The cgroup zswap charging exacerbates this situation, but is not a problem
in and of itself.

Nhat, as you pointed out, this is somewhat of an unrealistic scenario that
doesn't seem to indicate any specific problems to be solved, other than the
temporary cgroup zswap double-charging.

Would it be fair to evaluate this patch-series based on a more realistic
swapfile configuration based on 176G ZRAM, for which I had shared the data
in v2? There weren't any problems with swap slots availability or any
anomalies that I can think of with this setup, other than the fact that the
"Before" and "After" sys times could not be directly compared for 2 key
reasons:

 - ZRAM compressed data is not charged to the cgroup, similar to SSD.
 - ZSWAP compressed data is charged to the cgroup.

This disparity causes fewer swapouts, better sys time/throughput in the
"Before" experiments.

In the "After" experiments, this disparity causes more swapouts only with
zswap-lz4 due to the poorer compression ratio combined with the cgroup
charge; and hence a regression in sys time/throughput.

However, the better compression ratio with deflate-iaa results in
comparable # of swapouts as "Before", with better sys time/throughput.

My main rationale for suggesting the v2 ZRAM swapfile data is that the
disparities are the same as with the 4G SSD swapfile, but there are no anomalies,
with reasonable explanations for the data.

I would appreciate everyones' thoughts on this. If this sounds Ok, then I can
submit a v5 with the changes suggested by Yosry. 

I am listing here the v2 data with 176G ZRAM swapfile again, just for reference.

v2 data with cgroup zswap charging:
-----------------------------------

 64KB mTHP:
 ==========
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput | Change|
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |    118,928 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |     82,665 |       -30% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |    176,210 |        48% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time | Change|
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |   1,032.20 |   Baseline |
 |zswap-mTHP=Store    | ZSWAP lz4         |   1,854.51 |       -80% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |     582.71 |        44% |
  ------------------------------------------------------------------

  -----------------------------------------------------------------------
 | VMSTATS, mTHP ZSWAP stats,   |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
 | mTHP ZRAM stats:             |   mainline |       Store |       Store |
 |                              |            |         lz4 | deflate-iaa |
 |-----------------------------------------------------------------------|
 | pswpin                       |         16 |           0 |           0 |
 | pswpout                      |  7,770,720 |           0 |           0 |
 | zswpin                       |        547 |         695 |         579 |
 | zswpout                      |      1,394 |  15,462,778 |   7,284,554 |
 |-----------------------------------------------------------------------|
 | thp_swpout                   |          0 |           0 |           0 |
 | thp_swpout_fallback          |          0 |           0 |           0 |
 | pgmajfault                   |      3,786 |       3,541 |       3,367 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/zswpout |            |     966,328 |     455,196 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/swpout  |    485,670 |           0 |           0 |
  -----------------------------------------------------------------------

 2MB PMD-THP/2048K mTHP:
 =======================
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput | Change|
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |    177,340 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |     84,030 |       -53% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |    185,691 |         5% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time | Change|
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |     876.29 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |   1,740.55 |       -99% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |     650.33 |        26% |
  ------------------------------------------------------------------

  ------------------------------------------------------------------------- 
 | VMSTATS, mTHP ZSWAP stats,     |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
 | mTHP ZRAM stats:               |   mainline |       Store |       Store |
 |                                |            |         lz4 | deflate-iaa |
 |-------------------------------------------------------------------------|
 | pswpin                         |          0 |           0 |           0 |
 | pswpout                        |  8,628,224 |           0 |           0 |
 | zswpin                         |        678 |      22,733 |       1,641 |
 | zswpout                        |      1,481 |  14,828,597 |   9,404,937 |
 |-------------------------------------------------------------------------|
 | thp_swpout                     |     16,852 |           0 |           0 |
 | thp_swpout_fallback            |          0 |           0 |           0 |
 | pgmajfault                     |      3,467 |      25,550 |       4,800 |
 |-------------------------------------------------------------------------|
 | hugepages-2048kB/stats/zswpout |            |      28,924 |      18,366 |
 |-------------------------------------------------------------------------|
 | hugepages-2048kB/stats/swpout  |     16,852 |           0 |           0 |
  -------------------------------------------------------------------------

> 
> >  --------------------------------------------------------------------
> >
> >
> > Summary from the debug:
> > -----------------------
> > 1) Excess reclaim is exacerbated by zswap charge to cgroup. Without the
> >    charge, reclaim is on par with SSD for mTHP in the single process
> >    case. The multiple process excess reclaim seems to be most likely
> >    resulting from over-reclaim done by the cores, in their respective calls
> >    to mem_cgroup_handle_over_high().
> 
> Exarcebate, yes. I'm not 100% it's the sole or even the main cause.
> 
> You still see a degree of overreclaiming without zswap cgroup charging in:
> 
> 1. 70 processes, with mTHP
> 2. 70 processes, with 4K folios.

That's correct, although the over-reclaiming is not as bad with mTHP.

> 
> >
> > 2) The higher swapout activity with zswap as compared to SSD does not
> >    appear to be specific to mTHP. Higher reclaim activity and sys time
> >    regression with zswap (as compared to a setup where there is only SSD
> >    configured as swap) exists with 4K pages as far back as v6.7.
> 
> Yeah I can believe that without mthp, the same-ish workload would
> cause the same regression.

This makes sense.

> 
> >
> > 3) The debug indicates the hypothesis (2) is worth more investigation:
> >    Does a faster reclaim path somehow cause less allocation stalls; thereby
> >    causing more breaches of memory.high, hence more reclaim -- and does
> this
> >    cycle repeat, potentially leading to higher swapout activity with zswap?
> >    Any advise on this being a possibility, and suggestions/pointers to
> >    verify this, would be greatly appreciated.
> 
> Add stalls along the zswap path? :)

Yes, possibly! Hopefully, the swap slots availability learning from today's
experiments makes things a little clearer.

> 
> >
> > 4) Interestingly, the # of memcg_high events reduces significantly with 64K
> >    mTHP as compared to the above 4K high events data, when tested with v4
> >    and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP). This
> >    potentially indicates something to do with allocation efficiency
> >    countering the higher reclaim that seems to be caused by swapout
> >    efficiency.
> >
> > 5) Nhat, Yosry: would it be possible for you to run the 4K folios
> >    usemem -n 70 1g (with 60G memory.high) expmnt with 4G and some
> higher
> >    value SSD configuration in your setup and say, v6.11-rc3. I would like
> >    to rule out the memory constrained 4G SSD in my setup somehow skewing
> >    the behavior of zswap vis-a-vis
> >    allocation/memcg_handle_over_high/reclaim. I realize your time is
> >    valuable, however I think an independent confirmation of what I have
> >    been observing, would be really helpful for us to figure out potential
> >    root-causes and solutions.
> 
> It might take awhile for me to set up your benchmark, but yeah 4G
> swapfile seems small on a 64G host - of course it depends on the
> workload, but this has a lot memory usage. In fact the total memory
> usage (70G?) is slightly above memory.high + 4G swapfile - note that
> this is exarcebated by, once again, zswap's less-than-100% memory
> saving ratio.

I agree, this is somewhat of an unrealistic setup. Hopefully the data and my
learnings shared from the experiments I ran today, should provide some
insights into possible root-causes for the anomalous over-reclaim behavior.

Thanks,
Kanchana

> 
> >
> > 6) I tried a small change in memcontrol.c::mem_cgroup_handle_over_high()
> to
> >    break out of the loop if we have reclaimed a total of at least
> >    "nr_pages":
> >
> >         nr_reclaimed = reclaim_high(memcg,
> >                                     in_retry ? SWAP_CLUSTER_MAX : nr_pages,
> >                                     gfp_mask);
> >
> > +       nr_reclaimed_total += nr_reclaimed;
> > +
> > +       if (nr_reclaimed_total >= nr_pages)
> > +               goto out;
> >
> >
> >    This was only for debug purposes, and did seem to mitigate the higher
> >    reclaim behavior for 4K folios:
> >
> >  ZSWAP                  lz4             lz4             lz4
> >  ----------------------------------------------------------
> >  zswpout          1,305,367       1,349,195       1,529,235
> >  sys time (sec)      472.06          507.76          646.39
> >  Throughput KB/s     55,144          21,811          88,310
> >  memcg_high events  257,890         343,213         172,351
> >  ----------------------------------------------------------
> >
> > On average, this change results in 17% improvement in sys time, 2.35X
> > improvement in throughput and 30% fewer memcg_high events.
> >
> > I look forward to further inputs on next steps.
> >
> > Thanks,
> > Kanchana
> >
> >
> > >
> > > Thanks for this analysis. I will debug this some more, so we can better
> > > understand these results.
> > >
> > > Thanks,
> > > Kanchana