Re: RFC: A KVM-specific alternative to UserfaultFD

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Tue, 7 Nov 2023 17:25:06 +0100

On 11/6/23 21:23, Peter Xu wrote:
On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
Hi Paolo,

I'd like your feedback on whether you would merge a KVM-specific
alternative to UserfaultFD.

I'm reply to Peter's message because he already brought up some points 
that I'd have made...

   (b) UAPIs for marking GFNs present and non-present.

Similar, this is something bound to above bitmap design, and not needed for
uffd.  Extra interface?

We already use fallocate APIs to mark GFNs non-present in guest_memfd; 
and we also use them to mark GFNs present but it would not work to do 
that for atomic copy-and-allocate.  This UAPI could be pwrite() or a 
ioctl().

   (c) KVM_RUN support for returning to userspace on guest page faults
to non-present GFNs.

For (1), if the time to resolve a remote page fault is bottlenecked on the
network, concurrency may not matter a huge deal, IMHO.

That's likely, and it means we could simply extend 
KVM_EXIT_MEMORY_FAULT.  However, we need to be careful not to have a 
maze of twisty APIs, all different.

   (d) A notification mechanism and wait queue to coordinate KVM
accesses to non-present GFNs.

Probably uffd's wait queue to be reimplemented more or less.
Is this only used when there's no vcpu thread context?  I remember Anish's
other proposal on vcpu exit can already achieve similar without the queue.

I think this synchronization can be done mostly in userspace, at least 
on x86 (just like we got rid of the global VM-level dirty ring). But it 
remains a problem on Arm.

   (e) UAPI or KVM policy for collapsing SPTEs into huge pages as guest
memory becomes present.

This interface will also be needed if with userfaultfd, but if with uffd
it'll be a common interface so can be used outside VM context.

And it can be a generic API anyway (could be fadvise).

So why merge a KVM-specific alternative to UserfaultFD?

Taking a step back, let's look at what UserfaultFD is actually
providing for KVM VMs:

   1. Coordination of userspace accesses to guest memory.
   2. Coordination of KVM+guest accesses to guest memory.

VMMs already need to
manually intercept userspace _writes_ to guest memory to implement
dirty tracking efficiently. It's a small step beyond that to intercept
both reads and writes for post-copy. And VMMs are increasingly
multi-process. UserfaultFD provides coordination within a process but
VMMs already need to deal with coordinating across processes already.
i.e. UserfaultFD is only solving part of the problem for (1.).

This is partly true but it is missing non-vCPU kernel accesses, and it's 
what worries me the most if you propose this as a generic mechanism.  My 
gut feeling even without reading everything was (and it was confirmed 
after): I am open to merging some specific features that close holes in 
the userfaultfd API, but in general I like the unification between 
guest, userspace *and kernel* accesses that userfaultfd brings. The fact 
that it includes VGIC on Arm is a cherry on top. :)

For things other than guest_memfd, I want to ask Peter & co. if there 
could be a variant of userfaultfd that is better integrated with memfd, 
and solve the multi-process VMM issue.  For example, maybe a 
userfaultfd-like mechanism for memfd could handle missing faults from 
_any_ VMA for the memfd.

However, guest_memfd could be a good usecase for the mechanism that you 
suggest.  Currently guest_memfd cannot be mapped in userspace pages.  As 
such it cannot be used with userfaultfd.  Furthermore, because it is 
only mapped by hypervisor page tables, or written via hypervisor APIs, 
guest_memfd can easily track presence at 4KB granularity even if backed 
by huge pages.  That could be a point in favor of a KVM-specific solution.

Also, even if we envision mmap() support as one of the future extensions 
of guest_memfd, that does not mean you can use it together with 
userfaultfd.  For example, if we had restrictedmem-backed guest_memfd, 
or non-struct-page-backed guest_memfd, mmap() would be creating a 
VM_PFNMAP area.

Once you have the implementation done for guest_memfd, it is interesting 
to see how easily it extends to other, userspace-mappable kinds of 
memory.  But I still dislike the fact that you need some kind of extra 
protocol in userspace, for multi-process VMMs.  This is the kind of 
thing that the kernel is supposed to facilitate.  I'd like it to do 
_more_ of that (see above memfd pseudo-suggestion), not less.

All of these are addressed with a KVM-specific solution. A
KVM-specific solution can have:

   * Transparent support for any backing memory subsystem (tmpfs,
     HugeTLB, and even guest_memfd).

I'm curious how hard would it be to allow guest_memfd support userfaultfd.
David, do you know?

Did I answer above?  I suppose you'd have something along the lines of 
vma_is_shmem() added to vma_can_userfault; or possibly add something to 
vm_ops to bridge the differences.

The rest are already supported by uffd so I assume not a major problem.

Userfaultfd is kinda unusable for 1GB pages so I'm not sure I'd include 
it in the "already works" side, but yeah.

Paolo