Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications

James Houghton <jthoughton@xxxxxxxxxx> · Wed, 6 Mar 2024 15:24:04 -0800

On Thu, Feb 29, 2024 at 7:11 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> Hey, James,
>
> On Thu, Feb 29, 2024 at 05:37:23PM -0800, James Houghton wrote:
> > No matter what, we'll need to add (more) PUD support into the main mm,
> > so we could start with that, though it won't be easy. Then we would
> > need at least...
> >
> >   (1) ...a filesystem that implements huge_fault for PUDs
> >
> > It's not inconceivable to add support for this in shmem (where 1G
> > pages are allocated -- perhaps ahead of time -- with CMA, maybe?).
> > This could be done in hugetlbfs, but then you'd have to make sure that
> > the huge_fault implementation stays compatible with everything else in
> > hugetlb/hugetlbfs, perhaps making incremental progress difficult. Or
> > you could create hugetlbfs-v2. I'm honestly not sure which of these is
> > the least difficult -- probably the shmem route?
>
> IMHO hugetlb fault path can be the last to tackle; there seem to have other
> lower hanging fruits that are good candidates for such unifications works.
>
> For example, what if we can reduce customized hugetlb paths from 20 -> 2,
> where the customized fault() will be 1 out of the 2?  To further reduce
> that 2 paths we may need a new file system, but if it's good enough maybe
> we don't need v2, at least not for someone looking for a cleanup: that is
> more suitable who can properly define the new interface first, and it can
> be much more work than an unification effort, also orthogonal in some way.

This is a fine approach to take. At the same time, I think the
separate fault path is the most important difference between hugetlb
and main mm, so if we're doing a bunch of work to unify hugetlb with
mm (like, 20 -> 2 special paths), it'd be kind of a shame not to go
all the way. But I'm not exactly doing the work here. :)

(The other huge piece that I'd want unified is the huge_pte
architecture-specific functions, that's probably #2 on my list.)

> >
> >   (2) ...a mapcount (+refcount) system that works for PUD mappings.
> >
> > This discussion has progressed a lot since I last thought about it;
> > I'll let the experts figure this one out[1].
>
> I hope there will be an solid answer there.
>
> Otherwise IIRC the last plan was to use 1 mapcount for anything mapped
> underneath.  I still think it's a good plan, which may not apply to mTHP
> but could be perfectly efficient & simple to hugetlb.  The complexity lies
> in elsewhere other than the counting itself but I had a feeling it's still
> a workable solution.
>
> >
> > Anyway, I'm oversimplifying things, and it's been a while since I've
> > thought hard about this, so please take this all with a grain of salt.
> > The main motivating use-case for HGM (to allow for post-copy live
> > migration of HugeTLB-1G-backed VMs with userfaultfd) can be solved in
> > other ways[2].
>
> Do you know how far David went in that direction?  When there will be a
> prototype?  Would it easily work with MISSING faults (not MINOR)?

A prototype will come eventually. :)

It's valid for a user to use KVM-based demand paging with userfaultfd,
MISSING or MINOR. For MISSING, you could do:
- Upon getting a KVM fault, we will exit to KVM_RUN will exit to userspace.
- Fetch the page, install it with UFFDIO_COPY, then mark the page as
present with KVM.

KVM-based demand paging is redundant with userfaultfd in this case though.

With minor faults, the equivalent approach would be:
- Map memory twice. Register one with userfaultfd. The other ("alias
mapping") will be used to install memory.
- Use the userfaultfd-registered mapping to build the KVM memslots.
- Upon getting a KVM fault, KVM_RUN will exit.
- Fetch the page, install it by copying it into the alias mapping,
then UFFDIO_CONTINUE the KVM mapping, then mark the page as present
with KVM.

We can be a little more efficient with MINOR faults, provided we're
confident that KVM-based demand paging works properly:
- Map memory twice. Register one with userfaultfd.
- Give KVM the alias mapping, so we won't get userfaults on it. All
other components get the userfaultfd-registered mapping.
- KVM_RUN exits to userspace.
- Fetch the page, install it in the pagecache. Mark it as present with KVM.
- If other components get userfaults, fetch the page (if it needs to
be), then UFFDIO_CONTINUE to unblock it.

Now userfaultfd and KVM-based demand paging are no longer redundant.
Furthermore, if a user can guarantee that all other components are
able to properly participate in migration without userfaultfd (i.e.,
they are explicitly aware of demand paging), then the need for
userfaultfd is removed.

This is just like KVM's own dirty logging vs. userfaultfd-wp.

>
> I will be more than happy to see whatever solution come up from kernel that
> will resolve that pain for VMs first.  It's unfortunate KVM will has its
> own solution for hugetlb small mappings, but I also understand there's more
> than one demand to that besides hugetlb on 1G (even though I'm not 100%
> sure of that demand when I think it again today: is it a worry that the
> pgtable pages will take a lot of space when trapping minor-faults?  I
> haven't yet got time to revisit David's proposal there in the past two
> months; nor do I think I fully digested the details back then).

In my view, the main motivating factor is that userfaultfd is
inherently incompatible with guest_memfd. We talked a bit about the
potential to do a file-based userfaultfd, but it's very unclear how
that would work.

But a KVM-based demand paging system would be able to help with:
- post-copy for HugeTLB pages
- reduce unnecessary work/overhead in mm (both minor faults and missing faults).

The "unnecessary" work/overhead:
- shattered mm page tables as well as shattered EPT, whereas with a
KVM-based solution, only the EPT is shattered.
- must collapse both mm page tables and EPT at the end of post-copy,
instead of only the EPT
- mm page tables are mapped during post-copy, when they could be
completely present to begin with

You could make collapsing as efficient as possible (like, if possible,
have an mmu_notifier_collapse() instead of using invalidate_start/end,
so that KVM can do the fastest possible invalidations), but we're
fundamentally doing more work with userfaultfd.

> The answer to above could also help me to prioritize my work, e.g., hugetlb
> unification is probably something we should do regardless, at least for the
> sake of a healthy mm code base.  I have plan to move HGM or whatever it
> will be called to upstream if necessary, but it can also depends on how
> fast the other project goes, as personally I don't yet worry on hugetlb
> hwpoison yet (at least QEMU's hwpoison handling is still pretty much
> broken.. which is pretty unfortunate), but maybe any serious cloud provide
> still should care.

My hope with the unification is that HGM almost becomes a byproduct of
that effort. :)

The hwpoison case (in my case) is also solved with a KVM-based demand
paging system: we can use it to prevent access to the page, but
instead of demand-fetching, we inject poison. (We need HugeTLB to keep
mapping the page though.)