Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey, James,

On Thu, Feb 29, 2024 at 05:37:23PM -0800, James Houghton wrote:
> On Thu, Feb 22, 2024 at 12:50 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > I want to propose a session to discuss how we should unify hugetlb into
> > core mm.
> >
> > Due to legacy reasons, hugetlb has plenty of its own code paths that are
> > plugged into core mm, causing itself even more special than shmem.  While
> > it is a pretty decent and useful file system, efficient on supporting large
> > & statically allocated chunks of memory, it also added maintenance burden
> > due to having its own specific code paths spread all over the place.
> 
> Thank you for proposing this topic. HugeTLB is very useful (1G
> mappings, guaranteed hugepages, saving struct page overhead, shared
> page tables), but it is special in ways that make it a headache to
> modify (and making it harder to work on other mm features).
> 
> I haven't been able to spend much time with HugeTLB since the LSFMM
> talk last year, so I'm not much of an expert anymore. But I'll give my
> two cents anyway.
> 
> > It went into a bit of a mess, and it is messed up enough to become a reason
> > to not accept new major features like what used to be proposed last year to
> > map hugetlb pages in smaller sizes [1].
> >
> > We all seem to agree something needs to be done to hugetlb, but it seems
> > still not as clear on what exactly, then people forgot about it and move
> > on, until hit it again.  The problem didn't yet go away itself even if
> > nobody asks.
> >
> > Is it worthwhile to spend time do such work?  Do we really need a fresh new
> > hugetlb-v2 just to accept new features?  What exactly need to be
> > generalized for hugetlb?  Is huge_pte_offset() the culprit, or what else?
> > To what extent hugetlb is free to accept new features?
> 
> I think the smaller unification that has been done so far is great
> (thank you!!), but at some point additional unification will require a
> pretty heavy lift. Trying to enumerate some possible challenges:
> 
> What does HugeTLB do differently than main mm?
> - Page table walking, huge_pte_offset/etc., of course.
> - "huge_pte" as a concept (type-erased p?d_t), though it shares its
> type with pte_t.
> - Completely different page fault path (hugetlbfs doesn't implement
> vm_ops->{huge_,}fault).
> - mapcount
> - Reservation/MAP_NORESERVE
> - HWPoison handling
> - Synchronization (hugetlb_fault_mutex_table, VMA lock for PMD sharing)
> - more...
> 
> What does HugeTLB do that main mm doesn't do?
> - It keeps pools of hugepages that cannot be used for anything else.
> - It has PMD sharing (which can hopefully be replaced with mshare())
> - It has HVO (which can hopefully be dropped in a memdesc world)
> - more...?
> 
> Page table sharing and HVO are both important, but they're not
> fundamental to HugeTLB, so it's not impossible to make progress on
> drastic cleanup without them.
> 
> No matter what, we'll need to add (more) PUD support into the main mm,
> so we could start with that, though it won't be easy. Then we would
> need at least...
> 
>   (1) ...a filesystem that implements huge_fault for PUDs
> 
> It's not inconceivable to add support for this in shmem (where 1G
> pages are allocated -- perhaps ahead of time -- with CMA, maybe?).
> This could be done in hugetlbfs, but then you'd have to make sure that
> the huge_fault implementation stays compatible with everything else in
> hugetlb/hugetlbfs, perhaps making incremental progress difficult. Or
> you could create hugetlbfs-v2. I'm honestly not sure which of these is
> the least difficult -- probably the shmem route?

IMHO hugetlb fault path can be the last to tackle; there seem to have other
lower hanging fruits that are good candidates for such unifications works.

For example, what if we can reduce customized hugetlb paths from 20 -> 2,
where the customized fault() will be 1 out of the 2?  To further reduce
that 2 paths we may need a new file system, but if it's good enough maybe
we don't need v2, at least not for someone looking for a cleanup: that is
more suitable who can properly define the new interface first, and it can
be much more work than an unification effort, also orthogonal in some way.

> 
>   (2) ...a mapcount (+refcount) system that works for PUD mappings.
> 
> This discussion has progressed a lot since I last thought about it;
> I'll let the experts figure this one out[1].

I hope there will be an solid answer there.

Otherwise IIRC the last plan was to use 1 mapcount for anything mapped
underneath.  I still think it's a good plan, which may not apply to mTHP
but could be perfectly efficient & simple to hugetlb.  The complexity lies
in elsewhere other than the counting itself but I had a feeling it's still
a workable solution.

> 
> Anyway, I'm oversimplifying things, and it's been a while since I've
> thought hard about this, so please take this all with a grain of salt.
> The main motivating use-case for HGM (to allow for post-copy live
> migration of HugeTLB-1G-backed VMs with userfaultfd) can be solved in
> other ways[2].

Do you know how far David went in that direction?  When there will be a
prototype?  Would it easily work with MISSING faults (not MINOR)?

I will be more than happy to see whatever solution come up from kernel that
will resolve that pain for VMs first.  It's unfortunate KVM will has its
own solution for hugetlb small mappings, but I also understand there's more
than one demand to that besides hugetlb on 1G (even though I'm not 100%
sure of that demand when I think it again today: is it a worry that the
pgtable pages will take a lot of space when trapping minor-faults?  I
haven't yet got time to revisit David's proposal there in the past two
months; nor do I think I fully digested the details back then).

The answer to above could also help me to prioritize my work, e.g., hugetlb
unification is probably something we should do regardless, at least for the
sake of a healthy mm code base.  I have plan to move HGM or whatever it
will be called to upstream if necessary, but it can also depends on how
fast the other project goes, as personally I don't yet worry on hugetlb
hwpoison yet (at least QEMU's hwpoison handling is still pretty much
broken.. which is pretty unfortunate), but maybe any serious cloud provide
still should care.

> 
> > The goal of such a session is trying to make it clearer on answering above
> > questions.
> 
> I hope we can land on a clear answer this year. :)

Yes. :)  Thanks for the write-up and summary.

> 
> - James
> 
> [1]: https://lore.kernel.org/linux-mm/049e4674-44b6-4675-b53b-62e11481a7ce@xxxxxxxxxx/
> [2]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@xxxxxxxxxxxxxx/
> 

-- 
Peter Xu





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux