Re: [RFC] Improving userfaultfd scalability for live migration

Sean Christopherson <seanjc@xxxxxxxxxx> · Sat, 3 Dec 2022 01:03:38 +0000

On Thu, Dec 01, 2022, James Houghton wrote:
> #1, however, is quite doable. The main codepath for post-copy, the
> path that is taken when a vCPU attempts to access unmapped memory, is
> (for x86, but similar for other architectures): handle_ept_violation
> -> hva_to_pfn -> GUP -> handle_userfault. I'll call this the "EPT
> violation path" or "mem fault path." Other post-copy paths include at
> least: (i) KVM attempts to access guest memory via.
> copy_{to,from}_user -> #pf -> handle_mm_fault -> handle_userfault, and
> (ii) other callers of gfn_to_pfn* or hva_to_pfn* outside of the EPT
> violation path (e.g., instruction emulation).
> 
> We want the EPT violation path to be fast, as it is taken the vast
> majority of the time.

...

> == Getting the faulting GPA to userspace ==
> KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged),
> and it provides the main functionality we need. We can extend it
> easily to support our use case here, and I think we have at least two
> options:
> - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes
> KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would
> otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns
> -EFAULT).
> - We're already introducing a new CAP, so just tie the above behavior
> to whether or not one of the CAPs (below) is being used.

We might even be able to get away with a third option: unconditionally return
KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing
guest memory.

> == Problems ==
> The major problem here is that this only solves the scalability
> problem for the KVM demand paging case. Other userfaultfd users, if
> they have scalability problems, will need to find another approach.

It may not fully solve KVM's problem either.  E.g. if the VM is running nested
VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic)
via __get_user() when walking L1's EPT tables.

Disclaimer: I know _very_ little about UFFD.

Rather than add yet another flag to gup(), what about flag to say the task doesn't
want to wait for UFFD faults?  If desired/necessary, KVM could even toggle the flag
in KVM_RUN so that faults that occur outside of KVM ultimately don't send an actual
SIGBUGS.

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 07c81ab3fd4d..7f66b56dd6e7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
         * shmem_vm_ops->fault method is invoked even during
         * coredumping without mmap_lock and it ends up here.
         */
-       if (current->flags & (PF_EXITING|PF_DUMPCORE))
+       if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT))
                goto out;
 
        /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffb6eb55cd13..4c6c53ac6531 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1729,7 +1729,7 @@ extern struct pid *cad_pid;
 #define PF_MEMALLOC            0x00000800      /* Allocating memory */
 #define PF_NPROC_EXCEEDED      0x00001000      /* set_user() noticed that RLIMIT_NPROC was exceeded */
 #define PF_USED_MATH           0x00002000      /* If unset the fpu must be initialized before use */
-#define PF__HOLE__00004000     0x00004000
+#define PF_NO_UFFD_WAIT                0x00004000
 #define PF_NOFREEZE            0x00008000      /* This thread should not be frozen */
 #define PF__HOLE__00010000     0x00010000
 #define PF_KSWAPD              0x00020000      /* I am kswapd */