Re: RFC: A KVM-specific alternative to UserfaultFD

David Matlack <dmatlack@xxxxxxxxxx> · Mon, 13 Nov 2023 08:43:48 -0800

On Sat, Nov 11, 2023 at 9:30 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Sat, Nov 11, 2023 at 08:23:57AM -0800, David Matlack wrote:
> >
> > But now that I think about it, isn't the KVM-based approach useful to
> > the broader community as well? For example, QEMU could use the
> > KVM-based demand paging for all KVM-generated accesses to guest
> > memory. This would provide 4K-granular demand paging for _most_ memory
> > accesses. Then for vhost and userspace accesses, QEMU can set up a
> > separate VMA mapping of guest memory and use UserfaultFD. The
> > userspace/vhost accesses would have to be done at the huge page size
> > granularity (if using HugeTLB). But most accesses should still come
> > from KVM, so this would be a real improvement over a pure UserfaultFD
> > approach.
>
> I fully understand why you propose that, but not the one I prefer.  That
> means KVM is leaving other modules behind. :( And that's not even the same
> as the case where KVM wants to resolve hugetlb over 1G, because at least we
> tried :) it's just that the proposal got rejected, unfortunately, so far.
>
> IMHO we should still consider virt the whole community, not KVM separately,
> even if KVM is indeed a separate module.

KVM is not just any module, though. It is the _only_ module that
mediates _guest_ access to host memory. KVM is also a constant: Any
Linux-based VM that cares about performance is using KVM. guest_memfd,
on the other hand, is not unique and not constant. It's just one way
to back guest memory.

The way I see it, we are going to have one of the 2 following outcomes:

1. VMMs use KVM-based demand paging to mediate guest accesses, and
UserfaultFD to mediate userspace and vhost accesses.
2. VMMs use guest_memfd-based demand paging for guest_memfd, and
UserfaultFD for everything else.

I think there are many advantages of (1) over (2). (1) means that VMMs
can have a common software architecture for post-copy across any
memory types. Any optimizations we implement will apply to _all_
memory types, not just guest_memfd.

Mediating guest accesses _in KVM_ also has practical benefits. It
gives us more flexibility to solve problems that are specific to
virtual machines that other parts of the kernel don't care about. For
example, there's value in being able to preemptively mark memory as
present so that guest accesses don't have to notify userspace. During
a Live Migration, at the beginning of post-copy, there might be a
large number of guest pages that are present and don't need to be
fetched. The set might also be sparse. With KVM mediating access to
guest memory, we can just add a bitmap-based UAPI to KVM to mark
memory as present.

Sure we could technically add a bitmap-based API to guest_memfd, but
that would only solve the problem _for guest_memfd_.

Then there's the bounce-buffering problem. With a guest_memfd-based
scheme, there's no way for userspace to bypass the kernel's notion of
what's present. That means all of guest memory has to be
bounce-buffered. (More on this below.)

And even if we generalize (2) to all memfds, that's still not covering
all ways of backing guest memory.

Having KVM-specific UAPIs is also not new. Consider how KVM implements
its own dirty tracking.

And all of that is independent of the short-term HugeTLB benefit for Google.

>
> So if we're going to propose the new solution to replace userfault, I'd
> rather we add support separately for everything at least still public, even
> if it'll take more work, comparing to make it kvm-only.

To be clear, it's not a replacement for UserfaultFD. It would work in
conjunction with UserfaultFD.

>
> >
> > And on the more practical side... If we integrate missing page support
> > directly into guest_memfd, I'm not sure how one part would even work.
> > Userspace would need a way to write to missing pages before marking
> > them present. So we'd need some sort of special flag to mmap() to
> > bypass the missing page interception? I'm sure it's solvable, but the
> > KVM-based does not have this problem.
>
> Userfaults rely on the temp buffer.  Take UFFDIO_COPY as an example,
> uffdio_copy.src|len describes that. Then the kernel does the atomicity
> work.

Any solution that requires bounce-buffering (memcpy) is unlikely to be
tenable. The performance implications and CPU overhead required to
bounce-buffer _every_ page of guest memory during post-copy is too
much. That's why Google maintains a second mapping when using
UserfaultFD.

>
> I'm not sure why KVM-based doesn't have that problem.  IIUC it'll be the
> same?  We can't make the folio present in the gmemfd mapping if it doesn't
> yet contain full data copied over.  Having a special flag for the mapping
> is fine for each folio to allow different access permissions looks also
> fine, but that sounds like over complicated to me.

Both UserfaultFD and KVM-based demand paging don't have a
bounce-buffering problem because they mediate a specific _view_ of
guest memory, not the underlying memory. i.e. Neither mechanism
prevents userspace from creating a separate mapping where it can
access guest memory independent of the "present set", e.g. to RDMA
guest pages directly from the source.