On Mon, Dec 5, 2022 at 9:31 AM David Matlack <dmatlack@xxxxxxxxxx> wrote: > > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > On Sat, Dec 03, 2022 at 01:03:38AM +0000, Sean Christopherson wrote: > > > On Thu, Dec 01, 2022, James Houghton wrote: > > > > #1, however, is quite doable. The main codepath for post-copy, the > > > > path that is taken when a vCPU attempts to access unmapped memory, is > > > > (for x86, but similar for other architectures): handle_ept_violation > > > > -> hva_to_pfn -> GUP -> handle_userfault. I'll call this the "EPT > > > > violation path" or "mem fault path." Other post-copy paths include at > > > > least: (i) KVM attempts to access guest memory via. > > > > copy_{to,from}_user -> #pf -> handle_mm_fault -> handle_userfault, and > > > > (ii) other callers of gfn_to_pfn* or hva_to_pfn* outside of the EPT > > > > violation path (e.g., instruction emulation). > > > > > > > > We want the EPT violation path to be fast, as it is taken the vast > > > > majority of the time. > > > > > > ... > > > > > > > == Getting the faulting GPA to userspace == > > > > KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged), > > > > and it provides the main functionality we need. We can extend it > > > > easily to support our use case here, and I think we have at least two > > > > options: > > > > - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes > > > > KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would > > > > otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns > > > > -EFAULT). > > > > - We're already introducing a new CAP, so just tie the above behavior > > > > to whether or not one of the CAPs (below) is being used. > > > > > > We might even be able to get away with a third option: unconditionally return > > > KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing > > > guest memory. > > > > > > > == Problems == > > > > The major problem here is that this only solves the scalability > > > > problem for the KVM demand paging case. Other userfaultfd users, if > > > > they have scalability problems, will need to find another approach. > > > > > > It may not fully solve KVM's problem either. E.g. if the VM is running nested > > > VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic) > > > via __get_user() when walking L1's EPT tables. > > We could always modify FNAME(walk_addr_generic) to return out to user > space in the same way if that is indeed another bottleneck. Scratch that, walk_addr_generic ultimately has a ton of callers throughout KVM, so it would be difficult to plumb the error handling out to userspace. That being said, I'm not sure this is really going to be a bottleneck. Internally we are using the netlink socket for these faults without issue and, from a theoretical perspective, only a small fraction of guest memory (and thus a small fraction of uffd faults) are guest EPT tables. And any guest-initiated access of a not-present guest EPT table will still go through handle_ept_violation(), not walk_addr_generic(). > > > > > > > Disclaimer: I know _very_ little about UFFD. > > > > > > Rather than add yet another flag to gup(), what about flag to say the task doesn't > > > want to wait for UFFD faults? If desired/necessary, KVM could even toggle the flag > > > in KVM_RUN so that faults that occur outside of KVM ultimately don't send an actual > > > SIGBUGS. > > There are some copy_to/from_user() calls in KVM that cannot easily > exit out to KVM_RUN (for example, in the guts of the emulator IIRC). > But we could use your approach just to wrap the specific call sites > that can return from KVM_RUN. > > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > index 07c81ab3fd4d..7f66b56dd6e7 100644 > > > --- a/fs/userfaultfd.c > > > +++ b/fs/userfaultfd.c > > > @@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) > > > * shmem_vm_ops->fault method is invoked even during > > > * coredumping without mmap_lock and it ends up here. > > > */ > > > - if (current->flags & (PF_EXITING|PF_DUMPCORE)) > > > + if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT)) > > > goto out; > > > > I'll have a closer read on the nested part, but note that this path already > > has the mmap lock then it invalidates the goal if we want to avoid taking > > it from the first place, or maybe we don't care? > > > > If we want to avoid taking the mmap lock at all (hence the fast-gup > > approach), I'd also suggest we don't make it related to uffd at all but > > instead an interface to say "let's check whether the page tables are there > > (walk pgtable by fast-gup only), if not return to userspace". > > > > Because IIUC fast-gup has nothing to do with uffd, so it can also be a more > > generic interface. It's just that if the userspace knows what it's doing > > (postcopy-ing), it knows then the faults can potentially be resolved by > > userfaultfd at this stage. > > Are there any cases where fast-gup can fail while uffd is enabled but > it's not due to uffd? e.g. if a page is swapped out? I don't know what > userspace would do in those situations to make forward progress. > > > > > > > > > > > /* > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > > index ffb6eb55cd13..4c6c53ac6531 100644 > > > --- a/include/linux/sched.h > > > +++ b/include/linux/sched.h > > > @@ -1729,7 +1729,7 @@ extern struct pid *cad_pid; > > > #define PF_MEMALLOC 0x00000800 /* Allocating memory */ > > > #define PF_NPROC_EXCEEDED 0x00001000 /* set_user() noticed that RLIMIT_NPROC was exceeded */ > > > #define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */ > > > -#define PF__HOLE__00004000 0x00004000 > > > +#define PF_NO_UFFD_WAIT 0x00004000 > > > #define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */ > > > #define PF__HOLE__00010000 0x00010000 > > > #define PF_KSWAPD 0x00020000 /* I am kswapd */ > > > > > > > -- > > Peter Xu > >