One of the main challenges of using userfaultfd is its performance. I have some ideas for how userfaultfd could be made scalable for post-copy live migration. I'm not sending a series for now; I want to make sure the general approach here is something upstream would be interested in. == Background == The main scalability bottleneck comes from queueing faults with userfaultfd (i.e., interacting with fault_wqh/fault_pending_wqh). Doing so requires us to take those wait_queues' locks exclusively. I think we have these options to deal with this: 1. Avoid queueing faults (if possible) 2. Reduce contention (have lots of VMAs, 1 userfaultfd per VMA) 3. Allow multiple userfaultfds on a VMA. 4. Remove contention in the wait_queues (i.e., implement a "lockless wait_queue", whatever that might be). #2 can help a little bit, but we have two problems: we don't want TONS of VMAs, and it doesn't help the case where we have lots of faults on the same VMA. #3 could be possible, but it would be complicated and wouldn't completely fix the problem. #4, which I doubt is even feasible, would introduce a lot of complexity. #1, however, is quite doable. The main codepath for post-copy, the path that is taken when a vCPU attempts to access unmapped memory, is (for x86, but similar for other architectures): handle_ept_violation -> hva_to_pfn -> GUP -> handle_userfault. I'll call this the "EPT violation path" or "mem fault path." Other post-copy paths include at least: (i) KVM attempts to access guest memory via. copy_{to,from}_user -> #pf -> handle_mm_fault -> handle_userfault, and (ii) other callers of gfn_to_pfn* or hva_to_pfn* outside of the EPT violation path (e.g., instruction emulation). We want the EPT violation path to be fast, as it is taken the vast majority of the time. Note that this case is run by the vCPU thread itself / the thread that called KVM_RUN, and, if GUP "fails", in most cases KVM_RUN will exit with -EFAULT. We can use this to our advantage. If we can get KVM_RUN to exit with information about which page we need to fetch, we can do post-copy, and we never have to queue a page fault with userfaultfd! == Getting the faulting GPA to userspace == KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged), and it provides the main functionality we need. We can extend it easily to support our use case here, and I think we have at least two options: - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns -EFAULT). - We're already introducing a new CAP, so just tie the above behavior to whether or not one of the CAPs (below) is being used. == Potential Solutions == We need the solution to handle both the EPT violation case and other cases properly. Today, we can easily handle the EPT violation case if we just use UFFD_FEATURE_SIGBUS, but that doesn't fix the other cases (e.g. instruction emulation might fail, and we won't know what to do to resolve it). In both of the following solutions, hva_to_pfn needs to know if the caller is the EPT violation/mem fault path. To do that, we'll probably need to add a parameter to __gfn_to_pfn_memslot, gfn_to_pfn_prot, and maybe some other functions. I'm not sure what the cleanest way to do this is. It's possible that the new parameter here could be more general than "if we came from a mem fault": whether the caller wants GUP to fail quickly or not. Now that hva_to_pfn knows if it is being called from memfault, we can talk about how we can make it fail quickly in the userfaultfd case. -- Introduce KVM_CAP_USERFAULT_NOWAIT In hva_to_pfn_slow, if we came from a mem_fault, we can include a new flag in our call to GUP: FOLL_USERFAULT_NOWAIT. Then, in GUP, it can pass a new fault flag if it must call into a page fault routine: FAULT_USERFAULT_NOWAIT. That will make its way to handle_userfault(), and we can exit quickly (say, with VM_FAULT_SIGBUS, but any VM_FAULT_ERROR would do). Userspace can then take appropriate action: if they registered for MISSING faults, we can UFFDIO_COPY and, if they registered for MINOR faults, we can UFFDIO_CONTINUE. However, userspace no longer knows which kind of fault it was if they registered for both kinds. I don't see this as a problem. -- Introduce KVM_CAP_MEM_FAULT_NOWAIT In KVM, if this CAP is specified, never call hva_to_pfn_slow from the mem fault path, and always return KVM_PFN_ERR_FAULT if fast GUP fails. Fast GUP can fail for all sorts of reasons, so the actions userspace can take to resolve these are more complicated: 1) If userspace knows that we never UFFDIO_COPY'd or UFFDIO_CONTINUE'd the page, we can do that now and restart the vCPU. 2) If userspace has previously UFFDIO_COPY/CONTINUE'd, we need to get the kernel to make the page ready again. We could read from the faulting address, but that might set up a read-only mapping, so instead, we can use MADV_POPULATE_WRITE to set up a RW mapping, and fast GUP should succeed. If MADV_POPULATE_WRITE races with a different thread that is UFFDIO_COPY/CONTINUEing the same page and happens to win, it will drop into handle_userfault and be woken up with a UFFDIO_WAKE later via the same path that handles the non-mem-fault case. This solution might seem bizarre, but it makes it so that the mem fault path never needs to grab the mmap lock for reading. (We still have to grab it for reading in UFFDIO_COPY/CONTINUE.) == Problems == The major problem here is that this only solves the scalability problem for the KVM demand paging case. Other userfaultfd users, if they have scalability problems, will need to find another approach. - James Houghton [1]: https://lore.kernel.org/all/20221025151344.3784230-4-chao.p.peng@xxxxxxxxxxxxxxx/