Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Not only that. It was also because we didn't want to confuse users/devs that
assume that THP == PMD-sized.

I'll CC Hugh, I recall he had an opinion on that (I recall some comments about
cleanly separating both features towards the user).

Were those comments made during the first meeting? I don't recall them, but will
go back and watch the video.

Unless I am daydreaming, Willy made such comments during a THP cabal meeting, and Hugh somewhere on the list in other context. Maybe they changed their mind or I am making things up (after 4 days of fever dreams I don't know what's real anymore :D ).

The biggest concern was that huge really implies PMD (and maybe later PUD) -- and could end up confusing users, stats etc. Maybe it can be handled, I'll have to take a look at your proposal.

I was advocating just calling these things THP right from the start (and was using the arguments you are using in this mail :) ), but understood the concerns.

Apparently, freebsd wants to call these things "Medium-sized superpages" [1]. Of course, superpages are just huge pages, but everybody has to invent a new term for the same thing [I did not check who had it first ;) ]. Of course, under Windows static (like hugetlb) huge pages are called large-pages.

[1] https://www.freebsd.org/status/report-2022-04-2022-06/superpages/



Personally I think my latest proposal is a way to solve that problem, and in
that case, I personally think exposing it as an extension to THP is neater:

   - all existing THP controls work as they did before
   - new anon_orders and anon_always_mask files allow opt-in to
     smaller-than-PMD-orders

As "enable" controls anon only (that's correct, right?), maybe these should also
simply be called "orders" and "always_mask". shmem could get their own set, like
"shmem_enable".

Yes, could do it that way. I thought that since "shmem_" was used when shmem was
introduced, it would be clearer to prefix the new anon controls too. Happy to
remove though.


   - All exisitng counters remain unchanged, and continue to count PMD-mapped THP
     only:
        - /proc/meminfo:AnonHugePages
        - /sys/devices/system/node/nodeX/meminfo:AnonHugePages
        - /proc/vmstat:nr_anon_transparent_hugepages
        - /proc/<pid>/smaps[_roolup]:AnonHugePages
        - memory.stat(v1):rss_huge
        - memory.stat(v2):anon_thp
   - New counters introduced to count PTE-mapped THP/large folios:
        - /proc/meminfo:AnonHugePteMap
        - /sys/devices/system/node/nodeX/meminfo:AnonHugePteMap
        - /proc/vmstat:nr_anon_thp_pte
        - /proc/<pid>/smaps[_roolup]:AnonHugePteMap
        - memory.stat(v1):anon_thp_pte
        - memory.stat(v2):anon_thp_pte
   - It's a lot less code (I have an implementation for both approaches)

Admittedly, I haven't spent too much time thinking about the other thp counters
in vmstat yet (e.g. thp_fault_alloc, thp_fault_fallback, etc). Proposal is that
for now, they would continue to be PMD-order only. But I think you could
probably hook those upto the PTE-mapped ones as well, instead of duplicating all
the counters.

As Kiril mentioned, PTE-mapped THP is already a thing, so this approach just
formalises it.

Not quite. PTE-mapped THP were just a side-effect of the transparency handling.
We never allocated and populated PTE-mapped PMD-sized THP on allocation. So I
don't immediately see the connection between both for this case.

I'm just making the point that when they become PTE-mapped, we don't stop
calling them THP. I accept that its not exactly the same though.

... PTE-mappin them does not change their size ;) But I get what you mean.



Would you account a PTE-mapped (PMD-sized) THP as anon_thp or anon_thp_pte? What
if it's mapped via PTEs and PMDs? I don't see how that formalises that case for
the existing PMD-szed THP.

I account PTE-mapped THP as anon_thp_pte. And if the same folio is mapped both
ways, I account it in both counters. I'm not claiming that anon_thp +
anon_thp_pte = amount of allocated thp in total. anon_thp_pte is intended to
help debug; you can use it to see what percentage of pte-mapped memory is THP.



I also think the "huge" means PMD-size argument is a bit weak, given that THP
supports PUD-size today for file mappings, and in the context of hugetlb, huge
can mean contpte, pmd, contpmd, pud, etc.

I made similar statements in the past but was convinced otherwise :)


I'll have the patch set ready to post by Friday. How about I post it, then we
can continue the conversation in the context of the actual code? If the
concensus is that this is not the way to do it, then I'll post the large_folio
version instead?

No strong opinion from my side, I considered a "fresh start" without the THP
implication/thermonology after all the previous discussions cleaner [which I
think was one of the outcomes of the previous discussions].


My concern is that the "fresh start" is not as simple as it appears. I've come
to the conclusion that if we have a new interface, then it should really be a
strict superset of THP to make it extensible in future. But that opens questions

^ +1

about how you configure PMD-sized allocations when both interfaces disagree. For
"enabled" its fairly straightforward; you can do a logical OR. But its less
clear how to handle disagreement over defrag. And then you have huge_zero_page
and khugepaged etc, which might just stay with THP. But eventually we will

Probably we want everything that THP had (khugepaged, zeropage, ...) also on some (selected?) smaller orders.

probably want to do async collapse for smaller order folios too, and at that
point you have to duplicate all those controls... So I concluded that actually
it is cleaner to just bolt on a small-order extension to THP. I've updated all
the docs, and that was pretty simple to do, which usually suggests that the
extension is purely additive and shouldn't be confusing.

Fine with me. I don't quite like bitmaps exposed to user space, though. Just having a user-readable list or a "directory" with various options as files might be cleaner ...


Take a look at the patches, then make a judgement ;-)


... but we'll discuss it there :)

--
Cheers,

David / dhildenb





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux