Re: [RFC] Improving userfaultfd scalability for live migration

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Dec 03, 2022 at 01:03:38AM +0000, Sean Christopherson wrote:
> On Thu, Dec 01, 2022, James Houghton wrote:
> > #1, however, is quite doable. The main codepath for post-copy, the
> > path that is taken when a vCPU attempts to access unmapped memory, is
> > (for x86, but similar for other architectures): handle_ept_violation
> > -> hva_to_pfn -> GUP -> handle_userfault. I'll call this the "EPT
> > violation path" or "mem fault path." Other post-copy paths include at
> > least: (i) KVM attempts to access guest memory via.
> > copy_{to,from}_user -> #pf -> handle_mm_fault -> handle_userfault, and
> > (ii) other callers of gfn_to_pfn* or hva_to_pfn* outside of the EPT
> > violation path (e.g., instruction emulation).
> > 
> > We want the EPT violation path to be fast, as it is taken the vast
> > majority of the time.
> 
> ...
> 
> > == Getting the faulting GPA to userspace ==
> > KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged),
> > and it provides the main functionality we need. We can extend it
> > easily to support our use case here, and I think we have at least two
> > options:
> > - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes
> > KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would
> > otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns
> > -EFAULT).
> > - We're already introducing a new CAP, so just tie the above behavior
> > to whether or not one of the CAPs (below) is being used.
> 
> We might even be able to get away with a third option: unconditionally return
> KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing
> guest memory.
> 
> > == Problems ==
> > The major problem here is that this only solves the scalability
> > problem for the KVM demand paging case. Other userfaultfd users, if
> > they have scalability problems, will need to find another approach.
> 
> It may not fully solve KVM's problem either.  E.g. if the VM is running nested
> VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic)
> via __get_user() when walking L1's EPT tables.
> 
> Disclaimer: I know _very_ little about UFFD.
> 
> Rather than add yet another flag to gup(), what about flag to say the task doesn't
> want to wait for UFFD faults?  If desired/necessary, KVM could even toggle the flag
> in KVM_RUN so that faults that occur outside of KVM ultimately don't send an actual
> SIGBUGS.
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 07c81ab3fd4d..7f66b56dd6e7 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>          * shmem_vm_ops->fault method is invoked even during
>          * coredumping without mmap_lock and it ends up here.
>          */
> -       if (current->flags & (PF_EXITING|PF_DUMPCORE))
> +       if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT))
>                 goto out;

I'll have a closer read on the nested part, but note that this path already
has the mmap lock then it invalidates the goal if we want to avoid taking
it from the first place, or maybe we don't care?

If we want to avoid taking the mmap lock at all (hence the fast-gup
approach), I'd also suggest we don't make it related to uffd at all but
instead an interface to say "let's check whether the page tables are there
(walk pgtable by fast-gup only), if not return to userspace".

Because IIUC fast-gup has nothing to do with uffd, so it can also be a more
generic interface.  It's just that if the userspace knows what it's doing
(postcopy-ing), it knows then the faults can potentially be resolved by
userfaultfd at this stage.

>  
>         /*
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ffb6eb55cd13..4c6c53ac6531 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1729,7 +1729,7 @@ extern struct pid *cad_pid;
>  #define PF_MEMALLOC            0x00000800      /* Allocating memory */
>  #define PF_NPROC_EXCEEDED      0x00001000      /* set_user() noticed that RLIMIT_NPROC was exceeded */
>  #define PF_USED_MATH           0x00002000      /* If unset the fpu must be initialized before use */
> -#define PF__HOLE__00004000     0x00004000
> +#define PF_NO_UFFD_WAIT                0x00004000
>  #define PF_NOFREEZE            0x00008000      /* This thread should not be frozen */
>  #define PF__HOLE__00010000     0x00010000
>  #define PF_KSWAPD              0x00020000      /* I am kswapd */
> 

-- 
Peter Xu





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux