Re: RFC: A KVM-specific alternative to UserfaultFD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11/6/23 21:23, Peter Xu wrote:
On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
Hi Paolo,

I'd like your feedback on whether you would merge a KVM-specific
alternative to UserfaultFD.

I'm reply to Peter's message because he already brought up some points that I'd have made...

   (b) UAPIs for marking GFNs present and non-present.

Similar, this is something bound to above bitmap design, and not needed for
uffd.  Extra interface?

We already use fallocate APIs to mark GFNs non-present in guest_memfd; and we also use them to mark GFNs present but it would not work to do that for atomic copy-and-allocate. This UAPI could be pwrite() or a ioctl().

   (c) KVM_RUN support for returning to userspace on guest page faults
to non-present GFNs.

For (1), if the time to resolve a remote page fault is bottlenecked on the
network, concurrency may not matter a huge deal, IMHO.

That's likely, and it means we could simply extend KVM_EXIT_MEMORY_FAULT. However, we need to be careful not to have a maze of twisty APIs, all different.

   (d) A notification mechanism and wait queue to coordinate KVM
accesses to non-present GFNs.

Probably uffd's wait queue to be reimplemented more or less.
Is this only used when there's no vcpu thread context?  I remember Anish's
other proposal on vcpu exit can already achieve similar without the queue.

I think this synchronization can be done mostly in userspace, at least on x86 (just like we got rid of the global VM-level dirty ring). But it remains a problem on Arm.

   (e) UAPI or KVM policy for collapsing SPTEs into huge pages as guest
memory becomes present.

This interface will also be needed if with userfaultfd, but if with uffd
it'll be a common interface so can be used outside VM context.

And it can be a generic API anyway (could be fadvise).

So why merge a KVM-specific alternative to UserfaultFD?

Taking a step back, let's look at what UserfaultFD is actually
providing for KVM VMs:

   1. Coordination of userspace accesses to guest memory.
   2. Coordination of KVM+guest accesses to guest memory.

VMMs already need to
manually intercept userspace _writes_ to guest memory to implement
dirty tracking efficiently. It's a small step beyond that to intercept
both reads and writes for post-copy. And VMMs are increasingly
multi-process. UserfaultFD provides coordination within a process but
VMMs already need to deal with coordinating across processes already.
i.e. UserfaultFD is only solving part of the problem for (1.).

This is partly true but it is missing non-vCPU kernel accesses, and it's what worries me the most if you propose this as a generic mechanism. My gut feeling even without reading everything was (and it was confirmed after): I am open to merging some specific features that close holes in the userfaultfd API, but in general I like the unification between guest, userspace *and kernel* accesses that userfaultfd brings. The fact that it includes VGIC on Arm is a cherry on top. :)

For things other than guest_memfd, I want to ask Peter & co. if there could be a variant of userfaultfd that is better integrated with memfd, and solve the multi-process VMM issue. For example, maybe a userfaultfd-like mechanism for memfd could handle missing faults from _any_ VMA for the memfd.

However, guest_memfd could be a good usecase for the mechanism that you suggest. Currently guest_memfd cannot be mapped in userspace pages. As such it cannot be used with userfaultfd. Furthermore, because it is only mapped by hypervisor page tables, or written via hypervisor APIs, guest_memfd can easily track presence at 4KB granularity even if backed by huge pages. That could be a point in favor of a KVM-specific solution.

Also, even if we envision mmap() support as one of the future extensions of guest_memfd, that does not mean you can use it together with userfaultfd. For example, if we had restrictedmem-backed guest_memfd, or non-struct-page-backed guest_memfd, mmap() would be creating a VM_PFNMAP area.

Once you have the implementation done for guest_memfd, it is interesting to see how easily it extends to other, userspace-mappable kinds of memory. But I still dislike the fact that you need some kind of extra protocol in userspace, for multi-process VMMs. This is the kind of thing that the kernel is supposed to facilitate. I'd like it to do _more_ of that (see above memfd pseudo-suggestion), not less.

All of these are addressed with a KVM-specific solution. A
KVM-specific solution can have:

   * Transparent support for any backing memory subsystem (tmpfs,
     HugeTLB, and even guest_memfd).

I'm curious how hard would it be to allow guest_memfd support userfaultfd.
David, do you know?

Did I answer above? I suppose you'd have something along the lines of vma_is_shmem() added to vma_can_userfault; or possibly add something to vm_ops to bridge the differences.

The rest are already supported by uffd so I assume not a major problem.

Userfaultfd is kinda unusable for 1GB pages so I'm not sure I'd include it in the "already works" side, but yeah.

Paolo





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux