RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Thu, 29 Aug 2024 00:14:54 +0000

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> Sent: Wednesday, August 28, 2024 3:34 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Cc: Nhat Pham <nphamcs@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-
> mm@xxxxxxxxx; hannes@xxxxxxxxxxx; ryan.roberts@xxxxxxx; Huang, Ying
> <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
> Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Wed, Aug 28, 2024 at 11:50 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@xxxxxxxxx> wrote:
> >
> > Hi Yosry,
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > > Sent: Wednesday, August 28, 2024 12:44 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> > > Cc: Nhat Pham <nphamcs@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
> linux-
> > > mm@xxxxxxxxx; hannes@xxxxxxxxxxx; ryan.roberts@xxxxxxx; Huang,
> Ying
> > > <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux-
> foundation.org;
> > > Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K
> > > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> > > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> > >
> > > [..]
> > > >
> > > > This shows that in all cases, reclaim_high() is called only from the return
> > > > path to user mode after handling a page-fault.
> > >
> > > I am sorry I haven't been keeping up with this thread, I don't have a
> > > lot of capacity right now.
> > >
> > > If my understanding is correct, the summary of the problem we are
> > > observing here is that with high concurrency (70 processes), we
> > > observe worse system time, worse throughput, and higher memory_high
> > > events with zswap than SSD swap. This is true (with varying degrees)
> > > for 4K or mTHP, and with or without charging zswap compressed memory.
> > >
> > > Did I get that right?
> >
> > Thanks for your review and comments! Yes, this is correct.
> >
> > >
> > > I saw you also mentioned that reclaim latency is directly correlated
> > > to higher memory_high events.
> >
> > That was my observation based on the swap-constrained experiments with
> 4G SSD.
> > With a faster compressor, we allow allocations to proceed quickly, and if the
> pages
> > are not being faulted in, we need more swap slots. This increases the
> probability of
> > running out of swap slots with the 4G SSD backing device, which, as the data
> in v4
> > shows, causes memcg_swap_fail events, that drive folios to be resident in
> memory
> > (triggering memcg_high breaches as allocations proceed even without
> zswap cgroup
> > charging).
> >
> > Things change when the experiments are run in a situation where there is
> abundant
> > swap space and when the default behavior of zswap compressed data being
> charged
> > to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's
> backing
> > swapfile posted in v5. Now, the critical path to workload performance
> changes to
> > concurrent reclaims in response to memcg_high events due to allocation
> and zswap
> > usage. We see a lesser increase in swapout activity (as compared to the
> swap-constrained
> > experiments in v4), and compress latency seems to become the bottleneck.
> Each
> > individual process's throughput/sys time degrades mainly as a function of
> compress
> > latency. Anyway, these were some of my learnings from these experiments.
> Please
> > do let me know if there are other insights/analysis I could be missing.
> >
> > >
> > > Is it possible that with SSD swap, because we wait for IO during
> > > reclaim, this gives a chance for other processes to allocate and free
> > > the memory they need. While with zswap because everything is
> > > synchronous, all processes are trying to allocate their memory at the
> > > same time resulting in higher reclaim rates?
> > >
> > > IOW, maybe with zswap all the processes try to allocate their memory
> > > at the same time, so the total amount of memory needed at any given
> > > instance is much higher than memory.high, so we keep producing
> > > memory_high events and reclaiming. If 70 processes all require 1G at
> > > the same time, then we need 70G of memory at once, we will keep
> > > thrashing pages in/out of zswap.
> > >
> > > While with SSD swap, due to the waits imposed by IO, the allocations
> > > are more spread out and more serialized, and the amount of memory
> > > needed at any given instance is lower; resulting in less reclaim
> > > activity and ultimately faster overall execution?
> >
> > This is a very interesting hypothesis, that is along the lines of the
> > "slower compressor" essentially causing allocation stalls (and buffering us
> from
> > the swap slots unavailability effect) observation I gathered from the 4G SSD
> > experiments. I think this is a possibility.
> >
> > >
> > > Could you please describe what the processes are doing? Are they
> > > allocating memory and holding on to it, or immediately freeing it?
> >
> > I have been using the vm-scalability usemem workload for these
> experiments.
> > Thanks Ying for suggesting I use this workload!
> >
> > I am running usemem with these config options: usemem --init-time -w -O -
> n 70 1g.
> > This forks 70 processes, each of which does the following:
> >
> > 1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write
> permissions.
> > 2) Steps through and accesses each 8 bytes chunk of memory in the mmap-
> ed region, and:
> >     2.a) Writes the index of that chunk to the (unsigned long *) memory at
> that index.
> > 3) Generates statistics on throughput.
> >
> > There is an "munmap()" after step (2.a) that I have commented out because
> I wanted to
> > see how much cold memory resides in the zswap zpool after the workload
> exits. Interestingly,
> > this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M
> THP.
> 
> Does the process exit immediately after step (3)? The memory will be
> unmapped and freed once the process exits anyway, so removing an unmap
> that immediately precedes the process exiting should have no effect.

Yes, you're right.

> 
> I wonder how this changes if the processes sleep and keep the memory
> mapped for a while, to force the situation where all the memory is
> needed at the same time on SSD as well as zswap. This could make the
> playing field more even and force the same thrashing to happen on SSD
> for a more fair comparison.

Good point. I believe I saw an option in usemem that could facilitate this.
I will investigate.

> 
> It's not a fix, if very fast reclaim with zswap ends up causing more
> problems perhaps we need to tweak the throttling of memory.high or
> something.

Sure, that is a possibility. Although, proactive reclaim might mitigate this,
in which case very fast reclaim with zswap might help.

Thanks,
Kanchana