Re: RFC: A KVM-specific alternative to UserfaultFD

David Matlack <dmatlack@xxxxxxxxxx> · Sat, 11 Nov 2023 08:23:57 -0800

On Thu, Nov 9, 2023 at 11:20 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
> On Thu, Nov 09, 2023 at 09:58:49AM -0800, Sean Christopherson wrote:
> >
> > For both cases, KVM will need choke points on all accesses to guest memory.  Once
> > the choke points exist and we have signed up to maintain them, the extra burden of
> > gracefully handling "missing" memory versus frozen memory should be relatively
> > small, e.g. it'll mainly be the notify-and-wait uAPI.
> >
> > Don't get me wrong, I think Google's demand paging implementation should die a slow,
> > horrible death.   But I don't think userfaultfd is the answer for guest_memfd.
>
> As I replied in the other thread, I see possibility implementing
> userfaultfd on gmemfd, especially after I know your plan now treating
> user/kernel the same way.
>
> But I don't know whether I could have missed something here and there, and
> I'd like to read the problem first on above to understand the relationship
> between that "freeze guest mem" idea and the demand paging scheme.
>
> One thing I'd agree is we don't necessarily need to squash userfaultfd into
> gmemfd support of demand paging.  If gmemfd will only be used in KVM
> context then indeed it at least won't make a major difference; but still
> good if the messaging framework can be leveraged, meanwhile userspace that
> already support userfaultfd can cooperate with gmemfd much easier.
>
> In general, a major part of userfaultfd is really a messaging interface for
> faults to me.  A fault trap mechanism will be needed anyway for gmemfd,
> AFAIU. When that comes maybe we can have a clearer mind of what's next.

The idea to re-use userfaultfd as a notification mechanism is really
interesting.

I'm almost certain that guest page faults on missing pages can re-use
the KVM_CAP_EXIT_ON_MISSING UAPI that Anish is adding for UFFD [1]. So
that will be the same between VMA-based UserfaultFD and
KVM/guest_memfd-based demand paging.

And for the blocking notification in KVM, re-using the userfaultfd
file descriptor seems like a neat idea. We could have a KVM ioctl to
register the fd with KVM, and then KVM can notify when it needs to
block on a missing page. The uffd_msg struct can be extended to
support a new "gfn" type or "guest_memfd" type fault info. I'm not
quite sure how the wait-queuing will work, but I'm sure it's solvable.

With these 2 together, the UAPI for notifying userspace would be the
same for UserfaultFD and KVM.

As for whether to integrate the "missing" page support in guest_memfd
or KVM... I'm obviously partial to the latter because then Google can
also use it for HugeTLB.

But now that I think about it, isn't the KVM-based approach useful to
the broader community as well? For example, QEMU could use the
KVM-based demand paging for all KVM-generated accesses to guest
memory. This would provide 4K-granular demand paging for _most_ memory
accesses. Then for vhost and userspace accesses, QEMU can set up a
separate VMA mapping of guest memory and use UserfaultFD. The
userspace/vhost accesses would have to be done at the huge page size
granularity (if using HugeTLB). But most accesses should still come
from KVM, so this would be a real improvement over a pure UserfaultFD
approach.

And on the more practical side... If we integrate missing page support
directly into guest_memfd, I'm not sure how one part would even work.
Userspace would need a way to write to missing pages before marking
them present. So we'd need some sort of special flag to mmap() to
bypass the missing page interception? I'm sure it's solvable, but the
KVM-based does not have this problem.

[1] https://lore.kernel.org/kvm/20231109210325.3806151-1-amoorthy@xxxxxxxxxx/