On Tue, Oct 29, 2024 at 5:49 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > On 28/10/2024 21:40, Barry Song wrote: > > On Tue, Oct 29, 2024 at 5:24 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >> > >> > >> > >> On 28/10/2024 21:15, Barry Song wrote: > >>> On Tue, Oct 29, 2024 at 4:51 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>> > >>>> > >>>> > >>>> On 28/10/2024 20:42, Barry Song wrote: > >>>>> On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 28/10/2024 19:54, Barry Song wrote: > >>>>>>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On 28/10/2024 17:08, Yosry Ahmed wrote: > >>>>>>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 28/10/2024 16:33, Nhat Pham wrote: > >>>>>>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> I wonder if instead of having counters, it might be better to keep track > >>>>>>>>>>>> of the number of zeropages currently stored in zeromap, similar to how > >>>>>>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this > >>>>>>>>>>>> patch, but would give more insight of the current state of the system. > >>>>>>>>>>>> > >>>>>>>>>>>> Joshua (in CC) was going to have a look at that. > >>>>>>>>>>> > >>>>>>>>>>> I don't think one can substitute for the other. > >>>>>>>>>> > >>>>>>>>>> Yes agreed, they have separate uses and provide different information, but > >>>>>>>>>> maybe wasteful to have both types of counters? They are counters so maybe > >>>>>>>>>> dont consume too much resources but I think we should still think about > >>>>>>>>>> it.. > >>>>>>>>> > >>>>>>>>> Not for or against here, but I would say that statement is debatable > >>>>>>>>> at best for memcg stats :) > >>>>>>>>> > >>>>>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see > >>>>>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can > >>>>>>>>> quickly add up with a large number of CPUs/memcgs/stats. > >>>>>>>>> > >>>>>>>>> Also, when flushing the stats we iterate all of them to propagate > >>>>>>>>> updates from per-CPU counters. This is already a slowpath so adding > >>>>>>>>> one stat is not a big deal, but again because we iterate all stats on > >>>>>>>>> multiple CPUs (and sometimes on each node as well), the overall flush > >>>>>>>>> latency becomes a concern sometimes. > >>>>>>>>> > >>>>>>>>> All of that is not to say we shouldn't add more memcg stats, but we > >>>>>>>>> have to be mindful of the resources. > >>>>>>>> > >>>>>>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is > >>>>>>>> also not much). > >>>>>>>> > >>>>>>>> Not trying to block this patch in anyway. Just think its a good point > >>>>>>>> to discuss here if we are ok with both types of counters. If its too wasteful > >>>>>>>> then which one we should have. > >>>>>>> > >>>>>>> Hi Usama, > >>>>>>> my point is that with all the below three counters: > >>>>>>> 1. PSWPIN/PSWPOUT > >>>>>>> 2. ZSWPIN/ZSWPOUT > >>>>>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) > >>>>>>> > >>>>>>> Shouldn't we have been able to determine the portion of zeromap > >>>>>>> swap indirectly? > >>>>>>> > >>>>>> > >>>>>> Hmm, I might be wrong, but I would have thought no? > >>>>>> > >>>>>> What if you swapout a zero folio, but then discard it? > >>>>>> zeromap_swpout would be incremented, but zeromap_swapin would not. > >>>>> > >>>>> I understand. It looks like we have two issues to tackle: > >>>>> 1. We shouldn't let zeromap swap in or out anything that vanishes into > >>>>> a black hole > >>>>> 2. We want to find out how much I/O/memory has been saved due to zeromap so far > >>>>> > >>>>> From my perspective, issue 1 requires a "fix", while issue 2 is more > >>>>> of an optimization. > >>>> > >>>> Hmm I dont understand why point 1 would be an issue. > >>>> > >>>> If its discarded thats fine as far as I can see. > >>> > >>> it is fine to you and probably me who knows zeromap as well :-) but > >>> any userspace code > >>> as below might be entirely confused: > >>> > >>> p = malloc(1G); > >>> write p to 0; or write part of p to 0 > >>> madv_pageout(p, 1g) > >>> read p to swapin. > >>> > >>> The entire procedure used to involve 1GB of swap out and 1GB of swap in by any > >>> means. Now, it has recorded 0 swaps counted. > >>> > >>> I don't expect userspace is as smart as you :-) > >>> > >> Ah I completely agree, we need to account for it in some metric. I probably > >> misunderstood when you said "We shouldn't let zeromap swap in or out anything that > >> vanishes into a black hole", by we should not have the zeromap optimization for those > >> cases. What I guess you meant is we need to account for it in some metric. > >> > >>>> > >>>> As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin. > >>>> Because zswapped would take into account swapped out anon memory freed, MADV_FREE, > >>>> shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont. > >>> > >>> I understand. However, I believe what we really need to focus on is > >>> this: if we’ve > >>> swapped out, for instance, 100GB in the past hour, how much of that 100GB is > >>> zero? This information can help us assess the proportion of zero data in the > >>> workload, along with the potential benefits that zeromap can provide for memory, > >>> I/O space, or read/write operations. Additionally, having the second count > >>> can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE, > >>> and so on. > >>> > >> Yes completely agree! > >> > >> I think we can look into adding all three metrics, zeromap_swapped, zeromap_swpout, > >> zeromap_swpin (or whatever name works). > > > > It's great to reach an agreement. Let me work on some patches for it. > > Thanks! > > > > > By the way, I recently had an idea: if we can conduct the zeromap check > > earlier - for example - before allocating swap slots and pageout(), could > > we completely eliminate swap slot occupation and allocation/release > > for zeromap data? For example, we could use a special swap > > entry value in the PTE to indicate zero content and directly fill it with > > zeros when swapping back. We've observed that swap slot allocation and > > freeing can consume a lot of CPU and slow down functions like > > zap_pte_range and swap-in. If we can entirely skip these steps, it > > could improve performance. However, I'm uncertain about the benefits we > > would gain if we only have 1-2% zeromap data. > > If I remember correctly this was one of the ideas floated around in the > initial version of the zeromap series, but it was evaluated as a lot more > complicated to do than what the current zeromap code looks like. But I > think its definitely worth looking into! Sorry for the noise. I didn't review the initial discussion. But my feeling is that it might be valuable considering the report from Zhiguo: https://lore.kernel.org/linux-mm/20240805153639.1057-1-justinjiang@xxxxxxxx/ In fact, our recent benchmark also indicates that swap free could account for a significant portion in do_swap_page(). > > > > > I'm just putting this idea out there to see if you're interested in moving > > forward with it. :-) > > > >> > >>>> > >>>> > >>>>> > >>>>> I consider issue 1 to be more critical because, after observing a phone > >>>>> running for some time, I've been able to roughly estimate the portion > >>>>> zeromap can > >>>>> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a > >>>>> SWPIN counter. However, I agree that issue 2 still holds significant value > >>>>> as a separate patch. > >>>>> > >>> > > Thanks Barry