Re: RFC: A KVM-specific alternative to UserfaultFD

Sean Christopherson <seanjc@xxxxxxxxxx> · Thu, 9 Nov 2023 09:58:49 -0800

On Thu, Nov 09, 2023, David Matlack wrote:
> On Tue, Nov 7, 2023 at 2:29 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > On Tue, Nov 07, 2023 at 05:25:06PM +0100, Paolo Bonzini wrote:
> > > On 11/6/23 21:23, Peter Xu wrote:
> > > > On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
> > > >
> > >
> > > Once you have the implementation done for guest_memfd, it is interesting to
> > > see how easily it extends to other, userspace-mappable kinds of memory.  But
> > > I still dislike the fact that you need some kind of extra protocol in
> > > userspace, for multi-process VMMs.  This is the kind of thing that the
> > > kernel is supposed to facilitate.  I'd like it to do _more_ of that (see
> > > above memfd pseudo-suggestion), not less.
> >
> > Is that our future plan to extend gmemfd to normal memories?
> >
> > I see that gmemfd manages folio on its own.  I think it'll make perfect
> > sense if it's for use in CoCo context, where the memory is so special to be
> > generic anyway.
> >
> > However if to extend it to generic memories, I'm wondering how do we
> > support existing memory features of such memory which already exist with
> > KVM_SET_USER_MEMORY_REGION v1.  To name some:
> >
> >   - numa awareness

The plan is to add fbind() to mirror mbind().

> >   - swapping
> >   - cgroup

Accounting is already supported.  Fine-grained reclaim will likely never be
supported (see below re: swap).

> >   - punch hole (in a huge page, aka, thp split)

Already works.  What doesn't work is reconstituing a hugepage, but like swap,
I think that's something KVM should deliberately not support.

> >   - cma allocations for huge pages / page migrations

I suspect the direction guest_memfd will take will be to support a dedicated pool
of memory, a la hugetlbfs reservations.

> >   - ...
> 
> Sean has stated that he doesn't want guest_memfd to support swap. So I
> don't think guest_memfd will one day replace all guest memory
> use-cases. That also means that my idea to extend my proposal to
> guest_memfd VMAs has limited value. VMs that do not use guest_memfd
> would not be able to use it.

Yep.  This is a hill I'm extremely willing to die on.  I feel very, very strongly
that we should put a stake in the ground regarding swap and other traditional memory
management stuff.  The intent of guest_memfd is that it's a vehicle for supporting
use cases that don't fit into generic memory subsytems, e.g. CoCo VMs, and/or where
making guest memory inaccessible by default adds a lot of value at minimal cost.

guest_memfd isn't intended to be a wholesale replacement of VMA-based memory.
IMO, use cases that want to dynamically manage guest memory should be firmly
out-of-scope for guest_memfd.

> Paolo, it sounds like overall my proposal has limited value outside of
> GCE's use-case. And even if it landed upstream, it would bifrucate KVM
> VM post-copy support. So I think it's probably not worth pursuing
> further. Do you think that's a fair assessment? Getting a clear NACK
> on pushing this proposal upstream would be a nice outcome here since
> it helps inform our next steps.
> 
> That being said, we still don't have an upstream solution for 1G
> post-copy, which James pointed out is really the core issue. But there
> are other avenues we can explore in that direction such as cleaning up
> HugeTLB (very nebulous) or adding 1G+mmap()+userfaultfd support to
> guest_memfd. The latter seems promising.

mmap()+userfaultfd is the answer for userspace and vhost, but it is most defintiely
not the answer for guest_memfd within KVM.  The main selling point of guest_memfd
is that it doesn't require mapping the memory into userspace, i.e. userfaultfd
can't be the answer for KVM accesses unless we bastardize the entire concept of
guest_memfd.

And as I've proposed internally, the other thing related to live migration that I
think KVM should support is the ability to performantly and non-destructively freeze
guest memory, e.g. to allowing blocking KVM accesses to guest memory during blackout
without requiring userspace to destroy memslots to harden against memory corruption
due to KVM writing guest memory after userspace has taken the final snapshot of the
dirty bitmap.

For both cases, KVM will need choke points on all accesses to guest memory.  Once
the choke points exist and we have signed up to maintain them, the extra burden of
gracefully handling "missing" memory versus frozen memory should be relatively
small, e.g. it'll mainly be the notify-and-wait uAPI.

Don't get me wrong, I think Google's demand paging implementation should die a slow,
horrible death.   But I don't think userfaultfd is the answer for guest_memfd.