Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Aug 25, 2023, Sean Christopherson wrote:
> On Tue, Jul 18, 2023, Sean Christopherson wrote:
> > This is the next iteration of implementing fd-based (instead of vma-based)
> > memory for KVM guests.  If you want the full background of why we are doing
> > this, please go read the v10 cover letter[1].
> > 
> > The biggest change from v10 is to implement the backing storage in KVM
> > itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
> > See link[2] for details on why we pivoted to a KVM-specific approach.
> > 
> > Key word is "biggest".  Relative to v10, there are many big changes.
> > Highlights below (I can't remember everything that got changed at
> > this point).
> > 
> > Tagged RFC as there are a lot of empty changelogs, and a lot of missing
> > documentation.  And ideally, we'll have even more tests before merging.
> > There are also several gaps/opens (to be discussed in tomorrow's PUCK).
> > 
> > v11:
> >  - Test private<=>shared conversions *without* doing fallocate()
> >  - PUNCH_HOLE all memory between iterations of the conversion test so that
> >    KVM doesn't retain pages in the guest_memfd
> >  - Rename hugepage control to be a very generic ALLOW_HUGEPAGE, instead of
> >    giving it a THP or PMD specific name.
> >  - Fold in fixes from a lot of people (thank you!)
> >  - Zap SPTEs *before* updating attributes to ensure no weirdness, e.g. if
> >    KVM handles a page fault and looks at inconsistent attributes
> >  - Refactor MMU interaction with attributes updates to reuse much of KVM's
> >    framework for mmu_notifiers.
> > 
> > [1] https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@xxxxxxxxxxxxxxx
> > [2] https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@xxxxxxxxxx
> 
> Trimmed the Cc substantially to discuss what needs to be done (within our control)
> to have a chance of landing this "soon".
> 
> We've chipped away at the todo list a bit, but there are still several non-trivial
> things that need to get addressed before we can merge guest_memfd().  If we move
> *really* fast, e.g. get everything address in less than 3 weeks, we have an outside
> chance at hitting 6.7.  But honestly, I think it 6.7 is extremely unlikely.
> 
> For 6.8, we'll be in good shape if we can get a non-RFC posted in the next ~6
> weeks, i.e. by end of September, though obviously the sooner the better.  If we slip
> much beyond that, 6.8 is going to be tough due to people disappearing for year-end
> stuff and holidays, i.e. we won't have enough time to address feedback _and_ get a
> another round of reviews.

Update, and adding one more TODO (thanks Asish!).

Rather than wait for all of the TODOs to be completed, I am going to send a
refreshed v12 early next week.  I will squash the fixups I have, maaaybe write a
few changelogs, and rebase onto v6.6-rc1.   And I'll capture the remaining todos
(for merging) in the cover letter.

Paolo, correct me if I misunderstood our conversation, but we still have a shot
at hitting v6.7.  The goal of sending out v12 even with the todos is to let
people test something that's close-ish to merge-ready asap.  And if all goes well,
get v13 posted in a few weeks and queued for v6.7.

Most of the remaining work is fairly straightforward, e.g. writing changelogs,
removing the dedicated file system, etc.  The notable exception is the memory
error handling, a.k.a. kvm_gmem_error_page().

So Isaku, you're "officially" on the critical path :-)  We don't need a proper
test of any kind, but the code does need to be point-tested to ensure it a memory
error doesn't panic/crash the host.   Anything goes, e.g. if hacking the kernel
to generate memory errors is easier than figuring out the fault injection stuff,
then go for it, we just need some level of confidence that kvm_gmem_error_page()
won't catch fire.  Please holler if you need help, or want to hand off the task
to someone else.


Precise shared vs. private mappings
-----------------------------------

Teach mmu_notifier events to zap only shared mappings.  This is a must-have for
SNP performance, and maybe even a hard requirement for TDX.  Not zapping private
mappings in response to mmu_notifier events means KVM will *never* free private
memory in response to mmu_notifier vents, which will in turn allow KVM skip the
cache flush (a nasty WBINVD everywhere) that's currently done when memory that
*might* be private is freed.

Sean will own, plan here: https://lore.kernel.org/all/ZJX0hk+KpQP0KUyB@xxxxxxxxxx

> Speaking purely from a personal perspective, I really, really want to hit 6.8 so
> that this doesn't drag into 2024.
> 
> Loosely ordered by size and urgency (bigger, more urgent stuff first).  Please
> holler if I've missed something (taking notes is my Achilles heel).
> 
> 
> Filemap vs. xarray
> ------------------
> This is the main item that needs attention.  I don't want to merge guest_memfd()
> without doing this comparison, as not using filemap means we don't need AS_UNMOVABLE.
> Arguably we could merge a filemap implementation without AS_UNMOVABLE and just eat
> the suboptimal behavior, but not waiting a little while longer to do everything we
> can to get this right the first time seems ridiculous after we've been working on
> this for literally years.
> 
> Paolo was going to work on an axarray implementation, but AFAIK he hasn't done
> anything yet.  We (Google) don't have anyone available to work on an xarray
> implementation for several weeks (at best), so if anyone has the bandwidth and
> desire to take stab at an xarray implementation, please speak up.
> 
> 
> kvm_gmem_error_page()
> ---------------------
> As pointed out by Vishal[*], guest_memfd()'s error/poison handling is garbage.
> KVM needs to unmap, check for poison, and probably also restrict the allowed
> mapping size if a partial page is poisoned.
> 
> This item also needs actually testing, e.g. via error injection.  Writing a
> proper selftest may not be feasible, but at a bare minimum, someone needs to
> manually verify an error on a guest_memfd() can get routed all the way into the
> guest, e.g. as an #MC on x86.
> 
> This needs an owner.  I'm guessing 2-3 weeks?  Though I tend to be overly
> optimistic when sizing these things...
> 
> [*] https://lore.kernel.org/all/CAGtprH9a2jX-hdww9GPuMrO9noNeXkoqE8oejtVn2vD0AZa3zA@xxxxxxxxxxxxxx
> 
> 
> Documentation
> -------------
> Obviously a must have.  AFAIK, no one is "officially" signed up to work on this.
> I honestly haven't looked at the document in recent versions, so I have no idea
> how much effort is required.
> 
> 
> Fully anonymous inode vs. proper filesystem
> -------------------------------------------
> This is another one that needs to get sorted out before merging, but it should
> be a much smaller task (a day or two).  I will get to this in a few weeks unless
> someone beats me to the punch.
> 
> 
> KVM_CAP_GUEST_MEMFD
> -------------------
> New ioctl() needs a new cap.  Trivial, just capturing here so I don't forget.
> 
> 
> Changelogs
> ----------
> This one is on me, though I will probably procrastinate until all the other todo
> items are close to being finished.
> 
> 
> Tests
> -----
> I would really like to have a test that verifies KVM actually installs hugepages,
> but I'm ok merging without such a test, mainly because I suspect it will be
> annoyingly difficult to end up with a test that isn't flaky.
> 
> Beyond that, and the aforementioned memory poisoining, IMO, we have enough test
> coverage.  I am always open to more tests, but I don't think adding more coverage
> is a must have for merging.
> 
> 
> .release_folio and .invalidate_folio versus .evict_inode
> --------------------------------------------------------
> I think we're good on this one?  IIRC, without a need to "clean" physical memory
> (SNP and TDX), we don't need to do anything special.
> 
> Mike or Ackerley, am I forgetting anything?
> 
> 
> NUMA
> ----
> I am completely comfortable doing nothing as part of the initial merge.  We have
> line of sight to supporting NUMA policies in the form of fbind(), and I would be
> quite surprised if we get much pushback on implementing fbind().
> 
> 
> RSS stats
> ---------
> My preference is to not do anything in the initial implementation, and defer any
> changes until later.  IMO, while odd, not capturing guest_memfd() in RSS is
> acceptable as there are no virtual mappings to account.  I completely agree that
> we would ideally surface the memory usage to userspace in some way, but I don't
> think it's so critical that it needs to happen as part of the initial merge.
> 
> 
> Intrahost migration support
> ---------------------------
> Ackerley's RFC[*] is enough for me to have confidence that we can support
> intrahost migration without having to rework the ABI.  Please holler if you
> disagree.
> 
> [*] https://lkml.kernel.org/r/cover.1691446946.git.ackerleytng%40google.com
> 



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux