Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal

Ryan Roberts <ryan.roberts@xxxxxxx> · Wed, 27 Sep 2023 20:04:27 +0100

On 27/09/2023 16:32, David Hildenbrand wrote:
> On 27.09.23 09:23, Ryan Roberts wrote:
>> On 26/09/2023 19:31, David Hildenbrand wrote:
>>> On 25.09.23 10:51, Ryan Roberts wrote:
>>>> On 23/09/2023 01:33, John Hubbard wrote:
>>>>> On 9/22/23 08:48, Ryan Roberts wrote:
>>>>> ...
>>>>>> I never had any feedback on the below; I'm not sure if that means everyone is
>>>>>> happy or that nobody read it??
>>>>>
>>>>> One can never really know: zero or more people read it, and of those, no
>>>>> one hated it enough to send out a quick NAK. So that's a *possible*,
>>>>> lukewarm endorsement of sorts. Success! :)
>>>>
>>>> You really know how to fill a guy with confidence! ;-)
>>>>
>>>>>
>>>>> ...
>>>>>
>>>>>> BUT I've had yet another idea on the controls front, which would enable
>>>>>> exposing
>>>>>> this to user space as an extension to transparent_hugepage, while
>>>>>> continuing to
>>>>>> support THP as is and also be able to control THP and ALF (anon large folio)
>>>>>
>>>>> The new ALF / ANON_LARGE_FOLIO naming looks good to me. The grep aspect
>>>>> is a nice touch.
>>>>
>>>> Well if we go the route of the newest proposal, then I guess the naming is less
>>>> important, because it all attaches to transparent_hugepage.
>>>
>>> I agree that ALF is better. But having something under "THP", that is not THP
>>> and not accounted as THP ... I don't quite like it (although, before we
>>> discussed that approach in the past, I did like it).
>>
>> I know we discussed and concluded against putting it under THP in the past, but
>> I think that decision was driven by not not having any proposal that would allow
>> us to put it under THP without breaking the expectations of existing (PMD-sized)
>> THP users, or not being able to control use of the lower orders and PMD-order.
> 
> Not only that. It was also because we didn't want to confuse users/devs that
> assume that THP == PMD-sized.
> 
> I'll CC Hugh, I recall he had an opinion on that (I recall some comments about
> cleanly separating both features towards the user).

Were those comments made during the first meeting? I don't recall them, but will
go back and watch the video.

> 
>> Personally I think my latest proposal is a way to solve that problem, and in
>> that case, I personally think exposing it as an extension to THP is neater:
>>
>>   - all existing THP controls work as they did before
>>   - new anon_orders and anon_always_mask files allow opt-in to
>>     smaller-than-PMD-orders
> 
> As "enable" controls anon only (that's correct, right?), maybe these should also
> simply be called "orders" and "always_mask". shmem could get their own set, like
> "shmem_enable".

Yes, could do it that way. I thought that since "shmem_" was used when shmem was
introduced, it would be clearer to prefix the new anon controls too. Happy to
remove though.

> 
>>   - All exisitng counters remain unchanged, and continue to count PMD-mapped THP
>>     only:
>>        - /proc/meminfo:AnonHugePages
>>        - /sys/devices/system/node/nodeX/meminfo:AnonHugePages
>>        - /proc/vmstat:nr_anon_transparent_hugepages
>>        - /proc/<pid>/smaps[_roolup]:AnonHugePages
>>        - memory.stat(v1):rss_huge
>>        - memory.stat(v2):anon_thp
>>   - New counters introduced to count PTE-mapped THP/large folios:
>>        - /proc/meminfo:AnonHugePteMap
>>        - /sys/devices/system/node/nodeX/meminfo:AnonHugePteMap
>>        - /proc/vmstat:nr_anon_thp_pte
>>        - /proc/<pid>/smaps[_roolup]:AnonHugePteMap
>>        - memory.stat(v1):anon_thp_pte
>>        - memory.stat(v2):anon_thp_pte
>>   - It's a lot less code (I have an implementation for both approaches)
>>
>> Admittedly, I haven't spent too much time thinking about the other thp counters
>> in vmstat yet (e.g. thp_fault_alloc, thp_fault_fallback, etc). Proposal is that
>> for now, they would continue to be PMD-order only. But I think you could
>> probably hook those upto the PTE-mapped ones as well, instead of duplicating all
>> the counters.
>>
>> As Kiril mentioned, PTE-mapped THP is already a thing, so this approach just
>> formalises it.
> 
> Not quite. PTE-mapped THP were just a side-effect of the transparency handling.
> We never allocated and populated PTE-mapped PMD-sized THP on allocation. So I
> don't immediately see the connection between both for this case.

I'm just making the point that when they become PTE-mapped, we don't stop
calling them THP. I accept that its not exactly the same though.

> 
> Would you account a PTE-mapped (PMD-sized) THP as anon_thp or anon_thp_pte? What
> if it's mapped via PTEs and PMDs? I don't see how that formalises that case for
> the existing PMD-szed THP.

I account PTE-mapped THP as anon_thp_pte. And if the same folio is mapped both
ways, I account it in both counters. I'm not claiming that anon_thp +
anon_thp_pte = amount of allocated thp in total. anon_thp_pte is intended to
help debug; you can use it to see what percentage of pte-mapped memory is THP.

> 
>>
>> I also think the "huge" means PMD-size argument is a bit weak, given that THP
>> supports PUD-size today for file mappings, and in the context of hugetlb, huge
>> can mean contpte, pmd, contpmd, pud, etc.
> 
> I made similar statements in the past but was convinced otherwise :)
> 
>>
>> I'll have the patch set ready to post by Friday. How about I post it, then we
>> can continue the conversation in the context of the actual code? If the
>> concensus is that this is not the way to do it, then I'll post the large_folio
>> version instead?
> 
> No strong opinion from my side, I considered a "fresh start" without the THP
> implication/thermonology after all the previous discussions cleaner [which I
> think was one of the outcomes of the previous discussions].
> 

My concern is that the "fresh start" is not as simple as it appears. I've come
to the conclusion that if we have a new interface, then it should really be a
strict superset of THP to make it extensible in future. But that opens questions
about how you configure PMD-sized allocations when both interfaces disagree. For
"enabled" its fairly straightforward; you can do a logical OR. But its less
clear how to handle disagreement over defrag. And then you have huge_zero_page
and khugepaged etc, which might just stay with THP. But eventually we will
probably want to do async collapse for smaller order folios too, and at that
point you have to duplicate all those controls... So I concluded that actually
it is cleaner to just bolt on a small-order extension to THP. I've updated all
the docs, and that was pretty simple to do, which usually suggests that the
extension is purely additive and shouldn't be confusing.

Take a look at the patches, then make a judgement ;-)