On 28/10/2024 21:40, Barry Song wrote: > On Tue, Oct 29, 2024 at 5:24 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >> >> >> >> On 28/10/2024 21:15, Barry Song wrote: >>> On Tue, Oct 29, 2024 at 4:51 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >>>> >>>> >>>> >>>> On 28/10/2024 20:42, Barry Song wrote: >>>>> On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 28/10/2024 19:54, Barry Song wrote: >>>>>>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 28/10/2024 17:08, Yosry Ahmed wrote: >>>>>>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 28/10/2024 16:33, Nhat Pham wrote: >>>>>>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >>>>>>>>>>>> >>>>>>>>>>>> I wonder if instead of having counters, it might be better to keep track >>>>>>>>>>>> of the number of zeropages currently stored in zeromap, similar to how >>>>>>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this >>>>>>>>>>>> patch, but would give more insight of the current state of the system. >>>>>>>>>>>> >>>>>>>>>>>> Joshua (in CC) was going to have a look at that. >>>>>>>>>>> >>>>>>>>>>> I don't think one can substitute for the other. >>>>>>>>>> >>>>>>>>>> Yes agreed, they have separate uses and provide different information, but >>>>>>>>>> maybe wasteful to have both types of counters? They are counters so maybe >>>>>>>>>> dont consume too much resources but I think we should still think about >>>>>>>>>> it.. >>>>>>>>> >>>>>>>>> Not for or against here, but I would say that statement is debatable >>>>>>>>> at best for memcg stats :) >>>>>>>>> >>>>>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see >>>>>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can >>>>>>>>> quickly add up with a large number of CPUs/memcgs/stats. >>>>>>>>> >>>>>>>>> Also, when flushing the stats we iterate all of them to propagate >>>>>>>>> updates from per-CPU counters. This is already a slowpath so adding >>>>>>>>> one stat is not a big deal, but again because we iterate all stats on >>>>>>>>> multiple CPUs (and sometimes on each node as well), the overall flush >>>>>>>>> latency becomes a concern sometimes. >>>>>>>>> >>>>>>>>> All of that is not to say we shouldn't add more memcg stats, but we >>>>>>>>> have to be mindful of the resources. >>>>>>>> >>>>>>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is >>>>>>>> also not much). >>>>>>>> >>>>>>>> Not trying to block this patch in anyway. Just think its a good point >>>>>>>> to discuss here if we are ok with both types of counters. If its too wasteful >>>>>>>> then which one we should have. >>>>>>> >>>>>>> Hi Usama, >>>>>>> my point is that with all the below three counters: >>>>>>> 1. PSWPIN/PSWPOUT >>>>>>> 2. ZSWPIN/ZSWPOUT >>>>>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) >>>>>>> >>>>>>> Shouldn't we have been able to determine the portion of zeromap >>>>>>> swap indirectly? >>>>>>> >>>>>> >>>>>> Hmm, I might be wrong, but I would have thought no? >>>>>> >>>>>> What if you swapout a zero folio, but then discard it? >>>>>> zeromap_swpout would be incremented, but zeromap_swapin would not. >>>>> >>>>> I understand. It looks like we have two issues to tackle: >>>>> 1. We shouldn't let zeromap swap in or out anything that vanishes into >>>>> a black hole >>>>> 2. We want to find out how much I/O/memory has been saved due to zeromap so far >>>>> >>>>> From my perspective, issue 1 requires a "fix", while issue 2 is more >>>>> of an optimization. >>>> >>>> Hmm I dont understand why point 1 would be an issue. >>>> >>>> If its discarded thats fine as far as I can see. >>> >>> it is fine to you and probably me who knows zeromap as well :-) but >>> any userspace code >>> as below might be entirely confused: >>> >>> p = malloc(1G); >>> write p to 0; or write part of p to 0 >>> madv_pageout(p, 1g) >>> read p to swapin. >>> >>> The entire procedure used to involve 1GB of swap out and 1GB of swap in by any >>> means. Now, it has recorded 0 swaps counted. >>> >>> I don't expect userspace is as smart as you :-) >>> >> Ah I completely agree, we need to account for it in some metric. I probably >> misunderstood when you said "We shouldn't let zeromap swap in or out anything that >> vanishes into a black hole", by we should not have the zeromap optimization for those >> cases. What I guess you meant is we need to account for it in some metric. >> >>>> >>>> As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin. >>>> Because zswapped would take into account swapped out anon memory freed, MADV_FREE, >>>> shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont. >>> >>> I understand. However, I believe what we really need to focus on is >>> this: if we’ve >>> swapped out, for instance, 100GB in the past hour, how much of that 100GB is >>> zero? This information can help us assess the proportion of zero data in the >>> workload, along with the potential benefits that zeromap can provide for memory, >>> I/O space, or read/write operations. Additionally, having the second count >>> can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE, >>> and so on. >>> >> Yes completely agree! >> >> I think we can look into adding all three metrics, zeromap_swapped, zeromap_swpout, >> zeromap_swpin (or whatever name works). > > It's great to reach an agreement. Let me work on some patches for it. Thanks! > > By the way, I recently had an idea: if we can conduct the zeromap check > earlier - for example - before allocating swap slots and pageout(), could > we completely eliminate swap slot occupation and allocation/release > for zeromap data? For example, we could use a special swap > entry value in the PTE to indicate zero content and directly fill it with > zeros when swapping back. We've observed that swap slot allocation and > freeing can consume a lot of CPU and slow down functions like > zap_pte_range and swap-in. If we can entirely skip these steps, it > could improve performance. However, I'm uncertain about the benefits we > would gain if we only have 1-2% zeromap data. If I remember correctly this was one of the ideas floated around in the initial version of the zeromap series, but it was evaluated as a lot more complicated to do than what the current zeromap code looks like. But I think its definitely worth looking into! > > I'm just putting this idea out there to see if you're interested in moving > forward with it. :-) > >> >>>> >>>> >>>>> >>>>> I consider issue 1 to be more critical because, after observing a phone >>>>> running for some time, I've been able to roughly estimate the portion >>>>> zeromap can >>>>> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a >>>>> SWPIN counter. However, I agree that issue 2 still holds significant value >>>>> as a separate patch. >>>>> >>> > > Thanks > Barry