Sean, On Fri, Mar 17, 2023 at 01:17:22PM -0700, Sean Christopherson wrote: > On Fri, Mar 17, 2023, Oliver Upton wrote: > > On Wed, Mar 15, 2023 at 02:17:33AM +0000, Anish Moorthy wrote: > > > Add documentation, memslot flags, useful helper functions, and the > > > actual new capability itself. > > > > > > Memory fault exits on absent mappings are particularly useful for > > > userfaultfd-based live migration postcopy. When many vCPUs fault upon a > > > single userfaultfd the faults can take a while to surface to userspace > > > due to having to contend for uffd wait queue locks. Bypassing the uffd > > > entirely by triggering a vCPU exit avoids this contention and can improve > > > the fault rate by as much as 10x. > > > --- > > > Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++++++--- > > > include/linux/kvm_host.h | 6 ++++++ > > > include/uapi/linux/kvm.h | 3 +++ > > > tools/include/uapi/linux/kvm.h | 2 ++ > > > virt/kvm/kvm_main.c | 7 ++++++- > > > 5 files changed, 51 insertions(+), 4 deletions(-) > > > > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > > > index f9ca18bbec879..4932c0f62eb3d 100644 > > > --- a/Documentation/virt/kvm/api.rst > > > +++ b/Documentation/virt/kvm/api.rst > > > @@ -1312,6 +1312,7 @@ yet and must be cleared on entry. > > > /* for kvm_userspace_memory_region::flags */ > > > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) > > > #define KVM_MEM_READONLY (1UL << 1) > > > + #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2) > > > > call it KVM_MEM_EXIT_ABSENT_MAPPING > > Ooh, look, a bikeshed! :-) Couldn't help myself :) > I don't think it should have "EXIT" in the name. The exit to userspace is a side > effect, e.g. KVM already exits to userspace on unresolved userfaults. The only > thing this knob _directly_ controls is whether or not KVM attempts the slow path. > If we give the flag a name like "exit on absent userspace mappings", then KVM will > appear to do the wrong thing when KVM exits on a truly absent userspace mapping. > > And as I argued in the last version[*], I am _strongly_ opposed to KVM speculating > on why KVM is exiting to userspace. I.e. KVM should not set a special flag if > the memslot has "fast only" behavior. The only thing the flag should do is control > whether or not KVM tries slow paths, what KVM does in response to an unresolved > fault should be an orthogonal thing. > > E.g. If KVM encounters an unmapped page while prefetching SPTEs, KVM will (correctly) > not exit to userspace and instead simply terminate the prefetch. Obviously we > could solve that through documentation, but I don't see any benefit in making this > more complex than it needs to be. I couldn't care less about what the user-facing portion of this thing is called, TBH. We could just refer to it as KVM_MEM_BIT_2 /s The only bit I wanted to avoid is having a collision in the kernel between literal faults arising from hardware and exits to userspace that we are also calling 'faults'. > [*] https://lkml.kernel.org/r/Y%2B0RYMfw6pHrSLX4%40google.com > > > > +7.35 KVM_CAP_MEMORY_FAULT_NOWAIT > > > +-------------------------------- > > > + > > > +:Architectures: x86, arm64 > > > +:Returns: -EINVAL. > > > + > > > +The presence of this capability indicates that userspace may pass the > > > +KVM_MEM_ABSENT_MAPPING_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN > > > +to exit to populate 'kvm_run.memory_fault' and exit to userspace (*) in response > > > +to page faults for which the userspace page tables do not contain present > > > +mappings. Attempting to enable the capability directly will fail. > > > + > > > +The 'gpa' and 'len' fields of kvm_run.memory_fault will be set to the starting > > > +address and length (in bytes) of the faulting page. 'flags' will be set to > > > +KVM_MEMFAULT_REASON_ABSENT_MAPPING. > > > + > > > +Userspace should determine how best to make the mapping present, then take > > > +appropriate action. For instance, in the case of absent mappings this might > > > +involve establishing the mapping for the first time via UFFDIO_COPY/CONTINUE or > > > +faulting the mapping in using MADV_POPULATE_READ/WRITE. After establishing the > > > +mapping, userspace can return to KVM to retry the previous memory access. > > > + > > > +(*) NOTE: On x86, KVM_CAP_X86_MEMORY_FAULT_EXIT must be enabled for the > > > +KVM_MEMFAULT_REASON_ABSENT_MAPPING_reason: otherwise userspace will only receive > > > +a -EFAULT from KVM_RUN without any useful information. > > > > I'm not a fan of this architecture-specific dependency. Userspace is already > > explicitly opting in to this behavior by way of the memslot flag. These sort > > of exits are entirely orthogonal to the -EFAULT conversion earlier in the > > series. > > Ya, yet another reason not to speculate on why KVM wasn't able to resolve a fault. Regardless of what we name this memslot flag, we're already getting explicit opt-in from userspace for new behavior. There seems to be zero value in supporting memslot_flag && !MEMORY_FAULT_EXIT (i.e. returning EFAULT), so why even bother? Requiring two levels of opt-in to have the intended outcome for a single architecture seems nauseating from a userspace perspective. -- Thanks, Oliver