Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 25 Aug 2023 10:47:12 -0700

On Tue, Jul 18, 2023, Sean Christopherson wrote:
> This is the next iteration of implementing fd-based (instead of vma-based)
> memory for KVM guests.  If you want the full background of why we are doing
> this, please go read the v10 cover letter[1].
> 
> The biggest change from v10 is to implement the backing storage in KVM
> itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
> See link[2] for details on why we pivoted to a KVM-specific approach.
> 
> Key word is "biggest".  Relative to v10, there are many big changes.
> Highlights below (I can't remember everything that got changed at
> this point).
> 
> Tagged RFC as there are a lot of empty changelogs, and a lot of missing
> documentation.  And ideally, we'll have even more tests before merging.
> There are also several gaps/opens (to be discussed in tomorrow's PUCK).
> 
> v11:
>  - Test private<=>shared conversions *without* doing fallocate()
>  - PUNCH_HOLE all memory between iterations of the conversion test so that
>    KVM doesn't retain pages in the guest_memfd
>  - Rename hugepage control to be a very generic ALLOW_HUGEPAGE, instead of
>    giving it a THP or PMD specific name.
>  - Fold in fixes from a lot of people (thank you!)
>  - Zap SPTEs *before* updating attributes to ensure no weirdness, e.g. if
>    KVM handles a page fault and looks at inconsistent attributes
>  - Refactor MMU interaction with attributes updates to reuse much of KVM's
>    framework for mmu_notifiers.
> 
> [1] https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@xxxxxxxxxxxxxxx
> [2] https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@xxxxxxxxxx

Trimmed the Cc substantially to discuss what needs to be done (within our control)
to have a chance of landing this "soon".

We've chipped away at the todo list a bit, but there are still several non-trivial
things that need to get addressed before we can merge guest_memfd().  If we move
*really* fast, e.g. get everything address in less than 3 weeks, we have an outside
chance at hitting 6.7.  But honestly, I think it 6.7 is extremely unlikely.

For 6.8, we'll be in good shape if we can get a non-RFC posted in the next ~6
weeks, i.e. by end of September, though obviously the sooner the better.  If we slip
much beyond that, 6.8 is going to be tough due to people disappearing for year-end
stuff and holidays, i.e. we won't have enough time to address feedback _and_ get a
another round of reviews.

Speaking purely from a personal perspective, I really, really want to hit 6.8 so
that this doesn't drag into 2024.

Loosely ordered by size and urgency (bigger, more urgent stuff first).  Please
holler if I've missed something (taking notes is my Achilles heel).

Filemap vs. xarray
------------------
This is the main item that needs attention.  I don't want to merge guest_memfd()
without doing this comparison, as not using filemap means we don't need AS_UNMOVABLE.
Arguably we could merge a filemap implementation without AS_UNMOVABLE and just eat
the suboptimal behavior, but not waiting a little while longer to do everything we
can to get this right the first time seems ridiculous after we've been working on
this for literally years.

Paolo was going to work on an axarray implementation, but AFAIK he hasn't done
anything yet.  We (Google) don't have anyone available to work on an xarray
implementation for several weeks (at best), so if anyone has the bandwidth and
desire to take stab at an xarray implementation, please speak up.

kvm_gmem_error_page()
---------------------
As pointed out by Vishal[*], guest_memfd()'s error/poison handling is garbage.
KVM needs to unmap, check for poison, and probably also restrict the allowed
mapping size if a partial page is poisoned.

This item also needs actually testing, e.g. via error injection.  Writing a
proper selftest may not be feasible, but at a bare minimum, someone needs to
manually verify an error on a guest_memfd() can get routed all the way into the
guest, e.g. as an #MC on x86.

This needs an owner.  I'm guessing 2-3 weeks?  Though I tend to be overly
optimistic when sizing these things...

[*] https://lore.kernel.org/all/CAGtprH9a2jX-hdww9GPuMrO9noNeXkoqE8oejtVn2vD0AZa3zA@xxxxxxxxxxxxxx

Documentation
-------------
Obviously a must have.  AFAIK, no one is "officially" signed up to work on this.
I honestly haven't looked at the document in recent versions, so I have no idea
how much effort is required.

Fully anonymous inode vs. proper filesystem
-------------------------------------------
This is another one that needs to get sorted out before merging, but it should
be a much smaller task (a day or two).  I will get to this in a few weeks unless
someone beats me to the punch.

KVM_CAP_GUEST_MEMFD
-------------------
New ioctl() needs a new cap.  Trivial, just capturing here so I don't forget.

Changelogs
----------
This one is on me, though I will probably procrastinate until all the other todo
items are close to being finished.

Tests
-----
I would really like to have a test that verifies KVM actually installs hugepages,
but I'm ok merging without such a test, mainly because I suspect it will be
annoyingly difficult to end up with a test that isn't flaky.

Beyond that, and the aforementioned memory poisoining, IMO, we have enough test
coverage.  I am always open to more tests, but I don't think adding more coverage
is a must have for merging.

.release_folio and .invalidate_folio versus .evict_inode
--------------------------------------------------------
I think we're good on this one?  IIRC, without a need to "clean" physical memory
(SNP and TDX), we don't need to do anything special.

Mike or Ackerley, am I forgetting anything?

NUMA
----
I am completely comfortable doing nothing as part of the initial merge.  We have
line of sight to supporting NUMA policies in the form of fbind(), and I would be
quite surprised if we get much pushback on implementing fbind().

RSS stats
---------
My preference is to not do anything in the initial implementation, and defer any
changes until later.  IMO, while odd, not capturing guest_memfd() in RSS is
acceptable as there are no virtual mappings to account.  I completely agree that
we would ideally surface the memory usage to userspace in some way, but I don't
think it's so critical that it needs to happen as part of the initial merge.

Intrahost migration support
---------------------------
Ackerley's RFC[*] is enough for me to have confidence that we can support
intrahost migration without having to rework the ABI.  Please holler if you
disagree.

[*] https://lkml.kernel.org/r/cover.1691446946.git.ackerleytng%40google.com