Re: [PATCH RFC] mm: count zeromap read and set for swapout and swapin

Usama Arif <usamaarif642@xxxxxxxxx> · Mon, 28 Oct 2024 21:49:31 +0000

On 28/10/2024 21:40, Barry Song wrote:
> On Tue, Oct 29, 2024 at 5:24 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>>
>>
>>
>> On 28/10/2024 21:15, Barry Song wrote:
>>> On Tue, Oct 29, 2024 at 4:51 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>>>>
>>>>
>>>>
>>>> On 28/10/2024 20:42, Barry Song wrote:
>>>>> On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 28/10/2024 19:54, Barry Song wrote:
>>>>>>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 28/10/2024 17:08, Yosry Ahmed wrote:
>>>>>>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 28/10/2024 16:33, Nhat Pham wrote:
>>>>>>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I wonder if instead of having counters, it might be better to keep track
>>>>>>>>>>>> of the number of zeropages currently stored in zeromap, similar to how
>>>>>>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this
>>>>>>>>>>>> patch, but would give more insight of the current state of the system.
>>>>>>>>>>>>
>>>>>>>>>>>> Joshua (in CC) was going to have a look at that.
>>>>>>>>>>>
>>>>>>>>>>> I don't think one can substitute for the other.
>>>>>>>>>>
>>>>>>>>>> Yes agreed, they have separate uses and provide different information, but
>>>>>>>>>> maybe wasteful to have both types of counters? They are counters so maybe
>>>>>>>>>> dont consume too much resources but I think we should still think about
>>>>>>>>>> it..
>>>>>>>>>
>>>>>>>>> Not for or against here, but I would say that statement is debatable
>>>>>>>>> at best for memcg stats :)
>>>>>>>>>
>>>>>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see
>>>>>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can
>>>>>>>>> quickly add up with a large number of CPUs/memcgs/stats.
>>>>>>>>>
>>>>>>>>> Also, when flushing the stats we iterate all of them to propagate
>>>>>>>>> updates from per-CPU counters. This is already a slowpath so adding
>>>>>>>>> one stat is not a big deal, but again because we iterate all stats on
>>>>>>>>> multiple CPUs (and sometimes on each node as well), the overall flush
>>>>>>>>> latency becomes a concern sometimes.
>>>>>>>>>
>>>>>>>>> All of that is not to say we shouldn't add more memcg stats, but we
>>>>>>>>> have to be mindful of the resources.
>>>>>>>>
>>>>>>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is
>>>>>>>> also not much).
>>>>>>>>
>>>>>>>> Not trying to block this patch in anyway. Just think its a good point
>>>>>>>> to discuss here if we are ok with both types of counters. If its too wasteful
>>>>>>>> then which one we should have.
>>>>>>>
>>>>>>> Hi Usama,
>>>>>>> my point is that with all the below three counters:
>>>>>>> 1. PSWPIN/PSWPOUT
>>>>>>> 2. ZSWPIN/ZSWPOUT
>>>>>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever)
>>>>>>>
>>>>>>> Shouldn't we have been able to determine the portion of zeromap
>>>>>>> swap indirectly?
>>>>>>>
>>>>>>
>>>>>> Hmm, I might be wrong, but I would have thought no?
>>>>>>
>>>>>> What if you swapout a zero folio, but then discard it?
>>>>>> zeromap_swpout would be incremented, but zeromap_swapin would not.
>>>>>
>>>>> I understand. It looks like we have two issues to tackle:
>>>>> 1. We shouldn't let zeromap swap in or out anything that vanishes into
>>>>> a black hole
>>>>> 2. We want to find out how much I/O/memory has been saved due to zeromap so far
>>>>>
>>>>> From my perspective, issue 1 requires a "fix", while issue 2 is more
>>>>> of an optimization.
>>>>
>>>> Hmm I dont understand why point 1 would be an issue.
>>>>
>>>> If its discarded thats fine as far as I can see.
>>>
>>> it is fine to you and probably me who knows zeromap as well :-) but
>>> any userspace code
>>> as below might be entirely confused:
>>>
>>> p = malloc(1G);
>>> write p to 0; or write part of p to 0
>>> madv_pageout(p, 1g)
>>> read p to swapin.
>>>
>>> The entire procedure used to involve 1GB of swap out and 1GB of swap in by any
>>> means. Now, it has recorded 0 swaps counted.
>>>
>>> I don't expect userspace is as smart as you :-)
>>>
>> Ah I completely agree, we need to account for it in some metric. I probably
>> misunderstood when you said "We shouldn't let zeromap swap in or out anything that
>> vanishes into a black hole", by we should not have the zeromap optimization for those
>> cases. What I guess you meant is we need to account for it in some metric.
>>
>>>>
>>>> As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin.
>>>> Because zswapped would take into account swapped out anon memory freed, MADV_FREE,
>>>> shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont.
>>>
>>> I understand. However, I believe what we really need to focus on is
>>> this: if we’ve
>>> swapped out, for instance, 100GB in the past hour, how much of that 100GB is
>>> zero? This information can help us assess the proportion of zero data in the
>>> workload, along with the potential benefits that zeromap can provide for memory,
>>> I/O space, or read/write operations. Additionally, having the second count
>>> can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE,
>>> and so on.
>>>
>> Yes completely agree!
>>
>> I think we can look into adding all three metrics, zeromap_swapped, zeromap_swpout,
>> zeromap_swpin (or whatever name works).
> 
> It's great to reach an agreement. Let me work on some patches for it.

Thanks!

> 
> By the way, I recently had an idea: if we can conduct the zeromap check
> earlier - for example - before allocating swap slots and pageout(), could
> we completely eliminate swap slot occupation and allocation/release
> for zeromap data? For example, we could use a special swap
> entry value in the PTE to indicate zero content and directly fill it with
> zeros when swapping back. We've observed that swap slot allocation and
> freeing can consume a lot of CPU and slow down functions like
> zap_pte_range and swap-in. If we can entirely skip these steps, it
> could improve performance. However, I'm uncertain about the benefits we
> would gain if we only have 1-2% zeromap data.

If I remember correctly this was one of the ideas floated around in the
initial version of the zeromap series, but it was evaluated as a lot more
complicated to do than what the current zeromap code looks like. But I
think its definitely worth looking into!

> 
> I'm just putting this idea out there to see if you're interested in moving
> forward with it. :-)
> 
>>
>>>>
>>>>
>>>>>
>>>>> I consider issue 1 to be more critical because, after observing a phone
>>>>> running for some time, I've been able to roughly estimate the portion
>>>>> zeromap can
>>>>> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a
>>>>> SWPIN counter. However, I agree that issue 2 still holds significant value
>>>>> as a separate patch.
>>>>>
>>>
> 
> Thanks
> Barry