> -----Original Message----- > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > Sent: Wednesday, August 28, 2024 3:34 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: Nhat Pham <nphamcs@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux- > mm@xxxxxxxxx; hannes@xxxxxxxxxxx; ryan.roberts@xxxxxxx; Huang, Ying > <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; > Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > On Wed, Aug 28, 2024 at 11:50 AM Sridhar, Kanchana P > <kanchana.p.sridhar@xxxxxxxxx> wrote: > > > > Hi Yosry, > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > > Sent: Wednesday, August 28, 2024 12:44 AM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > > > Cc: Nhat Pham <nphamcs@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; > linux- > > > mm@xxxxxxxxx; hannes@xxxxxxxxxxx; ryan.roberts@xxxxxxx; Huang, > Ying > > > <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux- > foundation.org; > > > Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > > > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > > > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > > > > > [..] > > > > > > > > This shows that in all cases, reclaim_high() is called only from the return > > > > path to user mode after handling a page-fault. > > > > > > I am sorry I haven't been keeping up with this thread, I don't have a > > > lot of capacity right now. > > > > > > If my understanding is correct, the summary of the problem we are > > > observing here is that with high concurrency (70 processes), we > > > observe worse system time, worse throughput, and higher memory_high > > > events with zswap than SSD swap. This is true (with varying degrees) > > > for 4K or mTHP, and with or without charging zswap compressed memory. > > > > > > Did I get that right? > > > > Thanks for your review and comments! Yes, this is correct. > > > > > > > > I saw you also mentioned that reclaim latency is directly correlated > > > to higher memory_high events. > > > > That was my observation based on the swap-constrained experiments with > 4G SSD. > > With a faster compressor, we allow allocations to proceed quickly, and if the > pages > > are not being faulted in, we need more swap slots. This increases the > probability of > > running out of swap slots with the 4G SSD backing device, which, as the data > in v4 > > shows, causes memcg_swap_fail events, that drive folios to be resident in > memory > > (triggering memcg_high breaches as allocations proceed even without > zswap cgroup > > charging). > > > > Things change when the experiments are run in a situation where there is > abundant > > swap space and when the default behavior of zswap compressed data being > charged > > to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's > backing > > swapfile posted in v5. Now, the critical path to workload performance > changes to > > concurrent reclaims in response to memcg_high events due to allocation > and zswap > > usage. We see a lesser increase in swapout activity (as compared to the > swap-constrained > > experiments in v4), and compress latency seems to become the bottleneck. > Each > > individual process's throughput/sys time degrades mainly as a function of > compress > > latency. Anyway, these were some of my learnings from these experiments. > Please > > do let me know if there are other insights/analysis I could be missing. > > > > > > > > Is it possible that with SSD swap, because we wait for IO during > > > reclaim, this gives a chance for other processes to allocate and free > > > the memory they need. While with zswap because everything is > > > synchronous, all processes are trying to allocate their memory at the > > > same time resulting in higher reclaim rates? > > > > > > IOW, maybe with zswap all the processes try to allocate their memory > > > at the same time, so the total amount of memory needed at any given > > > instance is much higher than memory.high, so we keep producing > > > memory_high events and reclaiming. If 70 processes all require 1G at > > > the same time, then we need 70G of memory at once, we will keep > > > thrashing pages in/out of zswap. > > > > > > While with SSD swap, due to the waits imposed by IO, the allocations > > > are more spread out and more serialized, and the amount of memory > > > needed at any given instance is lower; resulting in less reclaim > > > activity and ultimately faster overall execution? > > > > This is a very interesting hypothesis, that is along the lines of the > > "slower compressor" essentially causing allocation stalls (and buffering us > from > > the swap slots unavailability effect) observation I gathered from the 4G SSD > > experiments. I think this is a possibility. > > > > > > > > Could you please describe what the processes are doing? Are they > > > allocating memory and holding on to it, or immediately freeing it? > > > > I have been using the vm-scalability usemem workload for these > experiments. > > Thanks Ying for suggesting I use this workload! > > > > I am running usemem with these config options: usemem --init-time -w -O - > n 70 1g. > > This forks 70 processes, each of which does the following: > > > > 1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write > permissions. > > 2) Steps through and accesses each 8 bytes chunk of memory in the mmap- > ed region, and: > > 2.a) Writes the index of that chunk to the (unsigned long *) memory at > that index. > > 3) Generates statistics on throughput. > > > > There is an "munmap()" after step (2.a) that I have commented out because > I wanted to > > see how much cold memory resides in the zswap zpool after the workload > exits. Interestingly, > > this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M > THP. > > Does the process exit immediately after step (3)? The memory will be > unmapped and freed once the process exits anyway, so removing an unmap > that immediately precedes the process exiting should have no effect. Yes, you're right. > > I wonder how this changes if the processes sleep and keep the memory > mapped for a while, to force the situation where all the memory is > needed at the same time on SSD as well as zswap. This could make the > playing field more even and force the same thrashing to happen on SSD > for a more fair comparison. Good point. I believe I saw an option in usemem that could facilitate this. I will investigate. > > It's not a fix, if very fast reclaim with zswap ends up causing more > problems perhaps we need to tweak the throttling of memory.high or > something. Sure, that is a possibility. Although, proactive reclaim might mitigate this, in which case very fast reclaim with zswap might help. Thanks, Kanchana