Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 06, 2024 at 03:24:04PM -0800, James Houghton wrote:
> On Thu, Feb 29, 2024 at 7:11 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > Hey, James,
> >
> > On Thu, Feb 29, 2024 at 05:37:23PM -0800, James Houghton wrote:
> > > No matter what, we'll need to add (more) PUD support into the main mm,
> > > so we could start with that, though it won't be easy. Then we would
> > > need at least...
> > >
> > >   (1) ...a filesystem that implements huge_fault for PUDs
> > >
> > > It's not inconceivable to add support for this in shmem (where 1G
> > > pages are allocated -- perhaps ahead of time -- with CMA, maybe?).
> > > This could be done in hugetlbfs, but then you'd have to make sure that
> > > the huge_fault implementation stays compatible with everything else in
> > > hugetlb/hugetlbfs, perhaps making incremental progress difficult. Or
> > > you could create hugetlbfs-v2. I'm honestly not sure which of these is
> > > the least difficult -- probably the shmem route?
> >
> > IMHO hugetlb fault path can be the last to tackle; there seem to have other
> > lower hanging fruits that are good candidates for such unifications works.
> >
> > For example, what if we can reduce customized hugetlb paths from 20 -> 2,
> > where the customized fault() will be 1 out of the 2?  To further reduce
> > that 2 paths we may need a new file system, but if it's good enough maybe
> > we don't need v2, at least not for someone looking for a cleanup: that is
> > more suitable who can properly define the new interface first, and it can
> > be much more work than an unification effort, also orthogonal in some way.
> 
> This is a fine approach to take. At the same time, I think the
> separate fault path is the most important difference between hugetlb
> and main mm, so if we're doing a bunch of work to unify hugetlb with
> mm (like, 20 -> 2 special paths), it'd be kind of a shame not to go
> all the way. But I'm not exactly doing the work here. :)

My goal was never merge everything, but make hugetlb more maintainable so
that people don't worry on its evolving at some point.  I don't think I
thought everything through, but I hope more things will be clear in the
next few months.

> 
> (The other huge piece that I'd want unified is the huge_pte
> architecture-specific functions, that's probably #2 on my list.)
> 
> > >
> > >   (2) ...a mapcount (+refcount) system that works for PUD mappings.
> > >
> > > This discussion has progressed a lot since I last thought about it;
> > > I'll let the experts figure this one out[1].
> >
> > I hope there will be an solid answer there.
> >
> > Otherwise IIRC the last plan was to use 1 mapcount for anything mapped
> > underneath.  I still think it's a good plan, which may not apply to mTHP
> > but could be perfectly efficient & simple to hugetlb.  The complexity lies
> > in elsewhere other than the counting itself but I had a feeling it's still
> > a workable solution.
> >
> > >
> > > Anyway, I'm oversimplifying things, and it's been a while since I've
> > > thought hard about this, so please take this all with a grain of salt.
> > > The main motivating use-case for HGM (to allow for post-copy live
> > > migration of HugeTLB-1G-backed VMs with userfaultfd) can be solved in
> > > other ways[2].
> >
> > Do you know how far David went in that direction?  When there will be a
> > prototype?  Would it easily work with MISSING faults (not MINOR)?
> 
> A prototype will come eventually. :)
> 
> It's valid for a user to use KVM-based demand paging with userfaultfd,
> MISSING or MINOR. For MISSING, you could do:
> - Upon getting a KVM fault, we will exit to KVM_RUN will exit to userspace.
> - Fetch the page, install it with UFFDIO_COPY, then mark the page as
> present with KVM.
> 
> KVM-based demand paging is redundant with userfaultfd in this case though.
> 
> With minor faults, the equivalent approach would be:
> - Map memory twice. Register one with userfaultfd. The other ("alias
> mapping") will be used to install memory.
> - Use the userfaultfd-registered mapping to build the KVM memslots.
> - Upon getting a KVM fault, KVM_RUN will exit.
> - Fetch the page, install it by copying it into the alias mapping,
> then UFFDIO_CONTINUE the KVM mapping, then mark the page as present
> with KVM.
> 
> We can be a little more efficient with MINOR faults, provided we're
> confident that KVM-based demand paging works properly:
> - Map memory twice. Register one with userfaultfd.
> - Give KVM the alias mapping, so we won't get userfaults on it. All
> other components get the userfaultfd-registered mapping.
> - KVM_RUN exits to userspace.
> - Fetch the page, install it in the pagecache. Mark it as present with KVM.
> - If other components get userfaults, fetch the page (if it needs to
> be), then UFFDIO_CONTINUE to unblock it.
> 
> Now userfaultfd and KVM-based demand paging are no longer redundant.
> Furthermore, if a user can guarantee that all other components are
> able to properly participate in migration without userfaultfd (i.e.,
> they are explicitly aware of demand paging), then the need for
> userfaultfd is removed.
> 
> This is just like KVM's own dirty logging vs. userfaultfd-wp.

The "register one for each" idea is interesting, but it's pretty sad to
know that it still requires userspace's awareness to support demand paging.

Supporting dirty logging is IMHO a pain already. It required so many
customized interfaces including KMV's, most of which may not be necessary
at all if mm can provide a generic async tracking API like soft-dirty or
uffd-wp at that time.  I guess soft-dirty didn't exist when GET_DIRTY_LOG
was proposed?

I think it means the proposal decided to ignore my previous questions on
things like "how do we support vhost with the new demand paging" in the
initial thread.

> 
> >
> > I will be more than happy to see whatever solution come up from kernel that
> > will resolve that pain for VMs first.  It's unfortunate KVM will has its
> > own solution for hugetlb small mappings, but I also understand there's more
> > than one demand to that besides hugetlb on 1G (even though I'm not 100%
> > sure of that demand when I think it again today: is it a worry that the
> > pgtable pages will take a lot of space when trapping minor-faults?  I
> > haven't yet got time to revisit David's proposal there in the past two
> > months; nor do I think I fully digested the details back then).
> 
> In my view, the main motivating factor is that userfaultfd is
> inherently incompatible with guest_memfd. We talked a bit about the
> potential to do a file-based userfaultfd, but it's very unclear how
> that would work.
> 
> But a KVM-based demand paging system would be able to help with:
> - post-copy for HugeTLB pages
> - reduce unnecessary work/overhead in mm (both minor faults and missing faults).
> 
> The "unnecessary" work/overhead:
> - shattered mm page tables as well as shattered EPT, whereas with a
> KVM-based solution, only the EPT is shattered.
> - must collapse both mm page tables and EPT at the end of post-copy,
> instead of only the EPT
> - mm page tables are mapped during post-copy, when they could be
> completely present to begin with

IMHO this is the trade-off of providing a generic solution.  Now afaict
we're pushing the complexity to outside KVM.  IIRC I left similar comments
before.

Obviously that's also one reason why I started working on something about
this (even if I don't know how far I'll go yet), as I don't want to see kvm
specific solution proposed only because mm rejected some generic solution
so there is no other option.  I want to provide that option and make a fair
comparison between the two.

So far I still think guest_memfd should implement its own demand paging
(e.g. there's no worry on "how to support vhost" in that case because vhost
doesn't even have a mapping if memory encrypted), leaving generic guest
memory types to mm like before.  But I'll stop here and leave my other
comments to when the proposal is sent.

> 
> You could make collapsing as efficient as possible (like, if possible,
> have an mmu_notifier_collapse() instead of using invalidate_start/end,
> so that KVM can do the fastest possible invalidations), but we're
> fundamentally doing more work with userfaultfd.
> 
> > The answer to above could also help me to prioritize my work, e.g., hugetlb
> > unification is probably something we should do regardless, at least for the
> > sake of a healthy mm code base.  I have plan to move HGM or whatever it
> > will be called to upstream if necessary, but it can also depends on how
> > fast the other project goes, as personally I don't yet worry on hugetlb
> > hwpoison yet (at least QEMU's hwpoison handling is still pretty much
> > broken.. which is pretty unfortunate), but maybe any serious cloud provide
> > still should care.
> 
> My hope with the unification is that HGM almost becomes a byproduct of
> that effort. :)

I think it'll be a separate project.  Unification effort seems to be always
wanted, while for the next step I'll need to evaluate how hard to support
the new interface in QEMU (no matter my own preference on the approach..).

I had a feeling that the new kvm demand paging proposal can have so many
limitations so that I have no choice to keep persuing HGM (just consider
when I need to implement a demand paging scheme for all virtio devices like
vhost; that can be N times work to me).

> 
> The hwpoison case (in my case) is also solved with a KVM-based demand
> paging system: we can use it to prevent access to the page, but
> instead of demand-fetching, we inject poison. (We need HugeTLB to keep
> mapping the page though.)

Hmm.. I'm curious how does it keep mapping the huge page if part of it is
poisoned with current mm code?

Thanks,

-- 
Peter Xu





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux