On Thu, Nov 9, 2023 at 11:20 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > On Thu, Nov 09, 2023 at 09:58:49AM -0800, Sean Christopherson wrote: > > > > For both cases, KVM will need choke points on all accesses to guest memory. Once > > the choke points exist and we have signed up to maintain them, the extra burden of > > gracefully handling "missing" memory versus frozen memory should be relatively > > small, e.g. it'll mainly be the notify-and-wait uAPI. > > > > Don't get me wrong, I think Google's demand paging implementation should die a slow, > > horrible death. But I don't think userfaultfd is the answer for guest_memfd. > > As I replied in the other thread, I see possibility implementing > userfaultfd on gmemfd, especially after I know your plan now treating > user/kernel the same way. > > But I don't know whether I could have missed something here and there, and > I'd like to read the problem first on above to understand the relationship > between that "freeze guest mem" idea and the demand paging scheme. > > One thing I'd agree is we don't necessarily need to squash userfaultfd into > gmemfd support of demand paging. If gmemfd will only be used in KVM > context then indeed it at least won't make a major difference; but still > good if the messaging framework can be leveraged, meanwhile userspace that > already support userfaultfd can cooperate with gmemfd much easier. > > In general, a major part of userfaultfd is really a messaging interface for > faults to me. A fault trap mechanism will be needed anyway for gmemfd, > AFAIU. When that comes maybe we can have a clearer mind of what's next. The idea to re-use userfaultfd as a notification mechanism is really interesting. I'm almost certain that guest page faults on missing pages can re-use the KVM_CAP_EXIT_ON_MISSING UAPI that Anish is adding for UFFD [1]. So that will be the same between VMA-based UserfaultFD and KVM/guest_memfd-based demand paging. And for the blocking notification in KVM, re-using the userfaultfd file descriptor seems like a neat idea. We could have a KVM ioctl to register the fd with KVM, and then KVM can notify when it needs to block on a missing page. The uffd_msg struct can be extended to support a new "gfn" type or "guest_memfd" type fault info. I'm not quite sure how the wait-queuing will work, but I'm sure it's solvable. With these 2 together, the UAPI for notifying userspace would be the same for UserfaultFD and KVM. As for whether to integrate the "missing" page support in guest_memfd or KVM... I'm obviously partial to the latter because then Google can also use it for HugeTLB. But now that I think about it, isn't the KVM-based approach useful to the broader community as well? For example, QEMU could use the KVM-based demand paging for all KVM-generated accesses to guest memory. This would provide 4K-granular demand paging for _most_ memory accesses. Then for vhost and userspace accesses, QEMU can set up a separate VMA mapping of guest memory and use UserfaultFD. The userspace/vhost accesses would have to be done at the huge page size granularity (if using HugeTLB). But most accesses should still come from KVM, so this would be a real improvement over a pure UserfaultFD approach. And on the more practical side... If we integrate missing page support directly into guest_memfd, I'm not sure how one part would even work. Userspace would need a way to write to missing pages before marking them present. So we'd need some sort of special flag to mmap() to bypass the missing page interception? I'm sure it's solvable, but the KVM-based does not have this problem. [1] https://lore.kernel.org/kvm/20231109210325.3806151-1-amoorthy@xxxxxxxxxx/