On Sat, Nov 11, 2023 at 9:30 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Sat, Nov 11, 2023 at 08:23:57AM -0800, David Matlack wrote: > > > > But now that I think about it, isn't the KVM-based approach useful to > > the broader community as well? For example, QEMU could use the > > KVM-based demand paging for all KVM-generated accesses to guest > > memory. This would provide 4K-granular demand paging for _most_ memory > > accesses. Then for vhost and userspace accesses, QEMU can set up a > > separate VMA mapping of guest memory and use UserfaultFD. The > > userspace/vhost accesses would have to be done at the huge page size > > granularity (if using HugeTLB). But most accesses should still come > > from KVM, so this would be a real improvement over a pure UserfaultFD > > approach. > > I fully understand why you propose that, but not the one I prefer. That > means KVM is leaving other modules behind. :( And that's not even the same > as the case where KVM wants to resolve hugetlb over 1G, because at least we > tried :) it's just that the proposal got rejected, unfortunately, so far. > > IMHO we should still consider virt the whole community, not KVM separately, > even if KVM is indeed a separate module. KVM is not just any module, though. It is the _only_ module that mediates _guest_ access to host memory. KVM is also a constant: Any Linux-based VM that cares about performance is using KVM. guest_memfd, on the other hand, is not unique and not constant. It's just one way to back guest memory. The way I see it, we are going to have one of the 2 following outcomes: 1. VMMs use KVM-based demand paging to mediate guest accesses, and UserfaultFD to mediate userspace and vhost accesses. 2. VMMs use guest_memfd-based demand paging for guest_memfd, and UserfaultFD for everything else. I think there are many advantages of (1) over (2). (1) means that VMMs can have a common software architecture for post-copy across any memory types. Any optimizations we implement will apply to _all_ memory types, not just guest_memfd. Mediating guest accesses _in KVM_ also has practical benefits. It gives us more flexibility to solve problems that are specific to virtual machines that other parts of the kernel don't care about. For example, there's value in being able to preemptively mark memory as present so that guest accesses don't have to notify userspace. During a Live Migration, at the beginning of post-copy, there might be a large number of guest pages that are present and don't need to be fetched. The set might also be sparse. With KVM mediating access to guest memory, we can just add a bitmap-based UAPI to KVM to mark memory as present. Sure we could technically add a bitmap-based API to guest_memfd, but that would only solve the problem _for guest_memfd_. Then there's the bounce-buffering problem. With a guest_memfd-based scheme, there's no way for userspace to bypass the kernel's notion of what's present. That means all of guest memory has to be bounce-buffered. (More on this below.) And even if we generalize (2) to all memfds, that's still not covering all ways of backing guest memory. Having KVM-specific UAPIs is also not new. Consider how KVM implements its own dirty tracking. And all of that is independent of the short-term HugeTLB benefit for Google. > > So if we're going to propose the new solution to replace userfault, I'd > rather we add support separately for everything at least still public, even > if it'll take more work, comparing to make it kvm-only. To be clear, it's not a replacement for UserfaultFD. It would work in conjunction with UserfaultFD. > > > > > And on the more practical side... If we integrate missing page support > > directly into guest_memfd, I'm not sure how one part would even work. > > Userspace would need a way to write to missing pages before marking > > them present. So we'd need some sort of special flag to mmap() to > > bypass the missing page interception? I'm sure it's solvable, but the > > KVM-based does not have this problem. > > Userfaults rely on the temp buffer. Take UFFDIO_COPY as an example, > uffdio_copy.src|len describes that. Then the kernel does the atomicity > work. Any solution that requires bounce-buffering (memcpy) is unlikely to be tenable. The performance implications and CPU overhead required to bounce-buffer _every_ page of guest memory during post-copy is too much. That's why Google maintains a second mapping when using UserfaultFD. > > I'm not sure why KVM-based doesn't have that problem. IIUC it'll be the > same? We can't make the folio present in the gmemfd mapping if it doesn't > yet contain full data copied over. Having a special flag for the mapping > is fine for each folio to allow different access permissions looks also > fine, but that sounds like over complicated to me. Both UserfaultFD and KVM-based demand paging don't have a bounce-buffering problem because they mediate a specific _view_ of guest memory, not the underlying memory. i.e. Neither mechanism prevents userspace from creating a separate mapping where it can access guest memory independent of the "present set", e.g. to RDMA guest pages directly from the source.