Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs

David Hildenbrand <david@xxxxxxxxxx> · Thu, 8 Jun 2023 08:34:10 +0200

On 08.06.23 02:02, David Rientjes wrote:
On Wed, 7 Jun 2023, Mike Kravetz wrote:

Are there strong objections to extending hugetlb for this support?

I don't want to get too involved in this discussion (busy), but I
absolutely agree on the points that were raised at LSF/MM that

(A) hugetlb is complicated and very special (many things not integrated
with core-mm, so we need special-casing all over the place). [example:
what is a pte?]

(B) We added a bunch of complexity in the past that some people
considered very important (and it was not feature frozen, right? ;) ).
Looking back, we might just not have done some of that, or done it
differently/cleaner -- better integrated in the core. (PMD sharing,
MAP_PRIVATE, a reservation mechanism that still requires preallocation
because it fails with NUMA/fork, ...)

(C) Unifying hugetlb and the core looks like it's getting more and more
out of reach, maybe even impossible with all the complexity we added
over the years (well, and keep adding).

Sure, HGM for the purpose of better hwpoison handling makes sense. But
hugetlb is probably 20 years old and hwpoison handling probably 13 years
old. So we managed to get quite far without that optimization.

Sane handling for memory poisoning and optimizations for live migration
are both much more important for the real-world 1GB hugetlb user, so it
doesn't quite have that lengthy of a history.

Unfortuantely, cloud providers receive complaints about both of these from
customers.  They are one of the most significant causes for poor customer
experience.

While people have proposed 1GB THP support in the past, it was nacked, in
part, because of the suggestion to just use existing 1GB support in
hugetlb instead :)

Yes, because I still think that the use for "transparent" (for the user) 
nowadays is very limited and not worth the complexity.

IMHO, what you really want is a pool of large pages that (guarantees 
about availability and nodes) and fine control about who gets these 
pages. That's what hugetlb provides.

In contrast to THP, you don't want to allow for
* Partially mmap, mremap, munmap, mprotect them
* Partially sharing then / COW'ing them
* Partially mixing them with other anon pages (MADV_DONTNEED + refault)
* Exclude them from some features KSM/swap
* (swap them out and eventually split them for that)

Because you don't want to get these pages PTE-mapped by the system 
*unless* there is a real reason (HGM, hwpoison) -- you want guarantees. 
Once such a page is PTE-mapped, you only want to collapse in place.

But you don't want special-HGM, you simply want the core to PTE-map them 
like a (file) THP.

IMHO, getting that realized much easier would be if we wouldn't have to 
care about some of the hugetlb complexity I raised (MAP_PRIVATE, PMD 
sharing), but maybe there is a way ...

Absolutely, HGM for better postcopy live migration also makes sense, I
guess nobody disagrees on that.

But as discussed in that session, maybe we should just start anew and
implement something that integrates nicely with the core , instead of
making hugetlb more complicated and even more special.

Certainly an ideal would be where we could support everybody's use cases
in a much more cohesive way with the rest of the core MM.  I'm
particularly concerned about how long it will take to get to that state
even if we had kernel developers committed to doing the work.  Even if we
had a design for this new subsystem that was more tightly coupled with the
core MM, it would take O(years) to implement, test, extend for other
architectures, and that's before any existing of users of hugetlb could
make the changes in the rest of their software stack to support it.

One interesting experiment would be, to just take hugetlb and remove all 
complexity (strip it to it's core: a pooling of large pages without 
special MAP_PRIVATE support, PMD sharing, reservations, ...). Then, see 
how to get core-mm to just treat them like PUD/PMD-mapped folios that 
can get PTE-mapped -- just like we have with FS-level THP.

Maybe we could then factor out what's shared with the old hugetlb 
implementations (e.g., pooling) and have both co-exist (e.g., configured 
at runtime).

The user-space interface for hugetlb would not change (well, except fail 
MAP_PRIVATE for now)

(especially, no messing with anon hugetlb pages)

Again, the spirit would be "teach the core to just treat them like 
folios that can get PTE-mapped" instead of "add HGM to hugetlb". If we 
can achieve that without a hugetlb v2, great. But i think that will be 
harder .... but I might be just wrong.

--
Cheers,

David / dhildenb