Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications

James Houghton <jthoughton@xxxxxxxxxx> · Thu, 29 Feb 2024 17:37:23 -0800

On Thu, Feb 22, 2024 at 12:50 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> I want to propose a session to discuss how we should unify hugetlb into
> core mm.
>
> Due to legacy reasons, hugetlb has plenty of its own code paths that are
> plugged into core mm, causing itself even more special than shmem.  While
> it is a pretty decent and useful file system, efficient on supporting large
> & statically allocated chunks of memory, it also added maintenance burden
> due to having its own specific code paths spread all over the place.

Thank you for proposing this topic. HugeTLB is very useful (1G
mappings, guaranteed hugepages, saving struct page overhead, shared
page tables), but it is special in ways that make it a headache to
modify (and making it harder to work on other mm features).

I haven't been able to spend much time with HugeTLB since the LSFMM
talk last year, so I'm not much of an expert anymore. But I'll give my
two cents anyway.

> It went into a bit of a mess, and it is messed up enough to become a reason
> to not accept new major features like what used to be proposed last year to
> map hugetlb pages in smaller sizes [1].
>
> We all seem to agree something needs to be done to hugetlb, but it seems
> still not as clear on what exactly, then people forgot about it and move
> on, until hit it again.  The problem didn't yet go away itself even if
> nobody asks.
>
> Is it worthwhile to spend time do such work?  Do we really need a fresh new
> hugetlb-v2 just to accept new features?  What exactly need to be
> generalized for hugetlb?  Is huge_pte_offset() the culprit, or what else?
> To what extent hugetlb is free to accept new features?

I think the smaller unification that has been done so far is great
(thank you!!), but at some point additional unification will require a
pretty heavy lift. Trying to enumerate some possible challenges:

What does HugeTLB do differently than main mm?
- Page table walking, huge_pte_offset/etc., of course.
- "huge_pte" as a concept (type-erased p?d_t), though it shares its
type with pte_t.
- Completely different page fault path (hugetlbfs doesn't implement
vm_ops->{huge_,}fault).
- mapcount
- Reservation/MAP_NORESERVE
- HWPoison handling
- Synchronization (hugetlb_fault_mutex_table, VMA lock for PMD sharing)
- more...

What does HugeTLB do that main mm doesn't do?
- It keeps pools of hugepages that cannot be used for anything else.
- It has PMD sharing (which can hopefully be replaced with mshare())
- It has HVO (which can hopefully be dropped in a memdesc world)
- more...?

Page table sharing and HVO are both important, but they're not
fundamental to HugeTLB, so it's not impossible to make progress on
drastic cleanup without them.

No matter what, we'll need to add (more) PUD support into the main mm,
so we could start with that, though it won't be easy. Then we would
need at least...

  (1) ...a filesystem that implements huge_fault for PUDs

It's not inconceivable to add support for this in shmem (where 1G
pages are allocated -- perhaps ahead of time -- with CMA, maybe?).
This could be done in hugetlbfs, but then you'd have to make sure that
the huge_fault implementation stays compatible with everything else in
hugetlb/hugetlbfs, perhaps making incremental progress difficult. Or
you could create hugetlbfs-v2. I'm honestly not sure which of these is
the least difficult -- probably the shmem route?

  (2) ...a mapcount (+refcount) system that works for PUD mappings.

This discussion has progressed a lot since I last thought about it;
I'll let the experts figure this one out[1].

Anyway, I'm oversimplifying things, and it's been a while since I've
thought hard about this, so please take this all with a grain of salt.
The main motivating use-case for HGM (to allow for post-copy live
migration of HugeTLB-1G-backed VMs with userfaultfd) can be solved in
other ways[2].

> The goal of such a session is trying to make it clearer on answering above
> questions.

I hope we can land on a clear answer this year. :)

- James

[1]: https://lore.kernel.org/linux-mm/049e4674-44b6-4675-b53b-62e11481a7ce@xxxxxxxxxx/
[2]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@xxxxxxxxxxxxxx/