Re: RFC: A KVM-specific alternative to UserfaultFD

David Matlack <dmatlack@xxxxxxxxxx> · Tue, 7 Nov 2023 12:04:21 -0800

On Tue, Nov 7, 2023 at 8:25 AM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
> On 11/6/23 21:23, Peter Xu wrote:
> > On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
> >>
> >> So why merge a KVM-specific alternative to UserfaultFD?
> >>
> >> Taking a step back, let's look at what UserfaultFD is actually
> >> providing for KVM VMs:
> >>
> >>    1. Coordination of userspace accesses to guest memory.
> >>    2. Coordination of KVM+guest accesses to guest memory.
> >>
> >> VMMs already need to
> >> manually intercept userspace _writes_ to guest memory to implement
> >> dirty tracking efficiently. It's a small step beyond that to intercept
> >> both reads and writes for post-copy. And VMMs are increasingly
> >> multi-process. UserfaultFD provides coordination within a process but
> >> VMMs already need to deal with coordinating across processes already.
> >> i.e. UserfaultFD is only solving part of the problem for (1.).
>
> This is partly true but it is missing non-vCPU kernel accesses, and it's
> what worries me the most if you propose this as a generic mechanism.

Non-vCPU accesses in KVM could still be handled with my proposal. But
I agree that non-KVM kernel accesses are a gap.

>  My
> gut feeling even without reading everything was (and it was confirmed
> after): I am open to merging some specific features that close holes in
> the userfaultfd API, but in general I like the unification between
> guest, userspace *and kernel* accesses that userfaultfd brings. The fact
> that it includes VGIC on Arm is a cherry on top. :)

Can you explain how VGIC interacts with UFFD? I'd like to understand
if/how that could work with a KVM-specific solution.

>
> For things other than guest_memfd, I want to ask Peter & co. if there
> could be a variant of userfaultfd that is better integrated with memfd,
> and solve the multi-process VMM issue.  For example, maybe a
> userfaultfd-like mechanism for memfd could handle missing faults from
> _any_ VMA for the memfd.
>
> However, guest_memfd could be a good usecase for the mechanism that you
> suggest.  Currently guest_memfd cannot be mapped in userspace pages.  As
> such it cannot be used with userfaultfd.  Furthermore, because it is
> only mapped by hypervisor page tables, or written via hypervisor APIs,
> guest_memfd can easily track presence at 4KB granularity even if backed
> by huge pages.  That could be a point in favor of a KVM-specific solution.
>
> Also, even if we envision mmap() support as one of the future extensions
> of guest_memfd, that does not mean you can use it together with
> userfaultfd.  For example, if we had restrictedmem-backed guest_memfd,
> or non-struct-page-backed guest_memfd, mmap() would be creating a
> VM_PFNMAP area.
>
> Once you have the implementation done for guest_memfd, it is interesting
> to see how easily it extends to other, userspace-mappable kinds of
> memory.  But I still dislike the fact that you need some kind of extra
> protocol in userspace, for multi-process VMMs.  This is the kind of
> thing that the kernel is supposed to facilitate.  I'd like it to do
> _more_ of that (see above memfd pseudo-suggestion), not less.

I was also thinking guest_memfd could be an avenue to solve the
multi-process issue. But a little different than the way you described
(because I still want to find an upstream solution for HugeTLB-backed
VMs, if possible).

What I was thinking was that my proposal could be extended to
guest_memfd VMAs. The way my proposal works is that all KVM and guest
accesses would be guaranteed to go through the VM's present bitmaps,
but accesses through VMAs are not. But with guest_memfd, once we add
mmap() support, we have access to the struct kvm at the time that
mmap() is called and when handling page faults on the guest_memfd VMA.
So it'd be possible for guest_memfd to consult the present bitmap,
notify userspace on non-present pages, and wait for pages to become
present when handling faults. This means we could funnel all accesses
through VMAs (multi-process and non-KVM kernel accesses) through a
single notification mechanism. i.e. It solves the multi-process issue
and unifies guest, kernel, and userspace accesses. BUT, only for
guest_memfd.

So in the short term we could provide a partial solution for
HugeTLB-backed VMs (at least unblocking Google's use-case) and in the
long-term there's line of sight of a unified solution.