RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Wed, 28 Aug 2024 18:50:46 +0000

Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> Sent: Wednesday, August 28, 2024 12:44 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Cc: Nhat Pham <nphamcs@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-
> mm@xxxxxxxxx; hannes@xxxxxxxxxxx; ryan.roberts@xxxxxxx; Huang, Ying
> <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> [..]
> >
> > This shows that in all cases, reclaim_high() is called only from the return
> > path to user mode after handling a page-fault.
> 
> I am sorry I haven't been keeping up with this thread, I don't have a
> lot of capacity right now.
> 
> If my understanding is correct, the summary of the problem we are
> observing here is that with high concurrency (70 processes), we
> observe worse system time, worse throughput, and higher memory_high
> events with zswap than SSD swap. This is true (with varying degrees)
> for 4K or mTHP, and with or without charging zswap compressed memory.
> 
> Did I get that right?

Thanks for your review and comments! Yes, this is correct.

> 
> I saw you also mentioned that reclaim latency is directly correlated
> to higher memory_high events.

That was my observation based on the swap-constrained experiments with 4G SSD.
With a faster compressor, we allow allocations to proceed quickly, and if the pages
are not being faulted in, we need more swap slots. This increases the probability of
running out of swap slots with the 4G SSD backing device, which, as the data in v4
shows, causes memcg_swap_fail events, that drive folios to be resident in memory
(triggering memcg_high breaches as allocations proceed even without zswap cgroup
charging).

Things change when the experiments are run in a situation where there is abundant
swap space and when the default behavior of zswap compressed data being charged
to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's backing
swapfile posted in v5. Now, the critical path to workload performance changes to
concurrent reclaims in response to memcg_high events due to allocation and zswap
usage. We see a lesser increase in swapout activity (as compared to the swap-constrained
experiments in v4), and compress latency seems to become the bottleneck. Each
individual process's throughput/sys time degrades mainly as a function of compress
latency. Anyway, these were some of my learnings from these experiments. Please
do let me know if there are other insights/analysis I could be missing.

> 
> Is it possible that with SSD swap, because we wait for IO during
> reclaim, this gives a chance for other processes to allocate and free
> the memory they need. While with zswap because everything is
> synchronous, all processes are trying to allocate their memory at the
> same time resulting in higher reclaim rates?
> 
> IOW, maybe with zswap all the processes try to allocate their memory
> at the same time, so the total amount of memory needed at any given
> instance is much higher than memory.high, so we keep producing
> memory_high events and reclaiming. If 70 processes all require 1G at
> the same time, then we need 70G of memory at once, we will keep
> thrashing pages in/out of zswap.
> 
> While with SSD swap, due to the waits imposed by IO, the allocations
> are more spread out and more serialized, and the amount of memory
> needed at any given instance is lower; resulting in less reclaim
> activity and ultimately faster overall execution?

This is a very interesting hypothesis, that is along the lines of the
"slower compressor" essentially causing allocation stalls (and buffering us from
the swap slots unavailability effect) observation I gathered from the 4G SSD
experiments. I think this is a possibility.

> 
> Could you please describe what the processes are doing? Are they
> allocating memory and holding on to it, or immediately freeing it?

I have been using the vm-scalability usemem workload for these experiments.
Thanks Ying for suggesting I use this workload!

I am running usemem with these config options: usemem --init-time -w -O -n 70 1g.
This forks 70 processes, each of which does the following:

1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write permissions.
2) Steps through and accesses each 8 bytes chunk of memory in the mmap-ed region, and:
    2.a) Writes the index of that chunk to the (unsigned long *) memory at that index.
3) Generates statistics on throughput.

There is an "munmap()" after step (2.a) that I have commented out because I wanted to
see how much cold memory resides in the zswap zpool after the workload exits. Interestingly,
this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M THP.

> 
> Do you have visibility into when each process allocates and frees memory?

Yes. Hopefully the above offers some clarifications.

Thanks,
Kanchana