On Tue, Oct 29, 2024 at 5:24 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > On 28/10/2024 21:15, Barry Song wrote: > > On Tue, Oct 29, 2024 at 4:51 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >> > >> > >> > >> On 28/10/2024 20:42, Barry Song wrote: > >>> On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>> > >>>> > >>>> > >>>> On 28/10/2024 19:54, Barry Song wrote: > >>>>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 28/10/2024 17:08, Yosry Ahmed wrote: > >>>>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On 28/10/2024 16:33, Nhat Pham wrote: > >>>>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>>>>>>>> > >>>>>>>>>> I wonder if instead of having counters, it might be better to keep track > >>>>>>>>>> of the number of zeropages currently stored in zeromap, similar to how > >>>>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this > >>>>>>>>>> patch, but would give more insight of the current state of the system. > >>>>>>>>>> > >>>>>>>>>> Joshua (in CC) was going to have a look at that. > >>>>>>>>> > >>>>>>>>> I don't think one can substitute for the other. > >>>>>>>> > >>>>>>>> Yes agreed, they have separate uses and provide different information, but > >>>>>>>> maybe wasteful to have both types of counters? They are counters so maybe > >>>>>>>> dont consume too much resources but I think we should still think about > >>>>>>>> it.. > >>>>>>> > >>>>>>> Not for or against here, but I would say that statement is debatable > >>>>>>> at best for memcg stats :) > >>>>>>> > >>>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see > >>>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can > >>>>>>> quickly add up with a large number of CPUs/memcgs/stats. > >>>>>>> > >>>>>>> Also, when flushing the stats we iterate all of them to propagate > >>>>>>> updates from per-CPU counters. This is already a slowpath so adding > >>>>>>> one stat is not a big deal, but again because we iterate all stats on > >>>>>>> multiple CPUs (and sometimes on each node as well), the overall flush > >>>>>>> latency becomes a concern sometimes. > >>>>>>> > >>>>>>> All of that is not to say we shouldn't add more memcg stats, but we > >>>>>>> have to be mindful of the resources. > >>>>>> > >>>>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is > >>>>>> also not much). > >>>>>> > >>>>>> Not trying to block this patch in anyway. Just think its a good point > >>>>>> to discuss here if we are ok with both types of counters. If its too wasteful > >>>>>> then which one we should have. > >>>>> > >>>>> Hi Usama, > >>>>> my point is that with all the below three counters: > >>>>> 1. PSWPIN/PSWPOUT > >>>>> 2. ZSWPIN/ZSWPOUT > >>>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) > >>>>> > >>>>> Shouldn't we have been able to determine the portion of zeromap > >>>>> swap indirectly? > >>>>> > >>>> > >>>> Hmm, I might be wrong, but I would have thought no? > >>>> > >>>> What if you swapout a zero folio, but then discard it? > >>>> zeromap_swpout would be incremented, but zeromap_swapin would not. > >>> > >>> I understand. It looks like we have two issues to tackle: > >>> 1. We shouldn't let zeromap swap in or out anything that vanishes into > >>> a black hole > >>> 2. We want to find out how much I/O/memory has been saved due to zeromap so far > >>> > >>> From my perspective, issue 1 requires a "fix", while issue 2 is more > >>> of an optimization. > >> > >> Hmm I dont understand why point 1 would be an issue. > >> > >> If its discarded thats fine as far as I can see. > > > > it is fine to you and probably me who knows zeromap as well :-) but > > any userspace code > > as below might be entirely confused: > > > > p = malloc(1G); > > write p to 0; or write part of p to 0 > > madv_pageout(p, 1g) > > read p to swapin. > > > > The entire procedure used to involve 1GB of swap out and 1GB of swap in by any > > means. Now, it has recorded 0 swaps counted. > > > > I don't expect userspace is as smart as you :-) > > > Ah I completely agree, we need to account for it in some metric. I probably > misunderstood when you said "We shouldn't let zeromap swap in or out anything that > vanishes into a black hole", by we should not have the zeromap optimization for those > cases. What I guess you meant is we need to account for it in some metric. > > >> > >> As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin. > >> Because zswapped would take into account swapped out anon memory freed, MADV_FREE, > >> shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont. > > > > I understand. However, I believe what we really need to focus on is > > this: if we’ve > > swapped out, for instance, 100GB in the past hour, how much of that 100GB is > > zero? This information can help us assess the proportion of zero data in the > > workload, along with the potential benefits that zeromap can provide for memory, > > I/O space, or read/write operations. Additionally, having the second count > > can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE, > > and so on. > > > Yes completely agree! > > I think we can look into adding all three metrics, zeromap_swapped, zeromap_swpout, > zeromap_swpin (or whatever name works). It's great to reach an agreement. Let me work on some patches for it. By the way, I recently had an idea: if we can conduct the zeromap check earlier - for example - before allocating swap slots and pageout(), could we completely eliminate swap slot occupation and allocation/release for zeromap data? For example, we could use a special swap entry value in the PTE to indicate zero content and directly fill it with zeros when swapping back. We've observed that swap slot allocation and freeing can consume a lot of CPU and slow down functions like zap_pte_range and swap-in. If we can entirely skip these steps, it could improve performance. However, I'm uncertain about the benefits we would gain if we only have 1-2% zeromap data. I'm just putting this idea out there to see if you're interested in moving forward with it. :-) > > >> > >> > >>> > >>> I consider issue 1 to be more critical because, after observing a phone > >>> running for some time, I've been able to roughly estimate the portion > >>> zeromap can > >>> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a > >>> SWPIN counter. However, I agree that issue 2 still holds significant value > >>> as a separate patch. > >>> > > Thanks Barry