On Wed, Feb 15, 2023 at 12:59 AM Oliver Upton <oliver.upton@xxxxxxxxx> wrote: > > Why not use KVM_ENABLE_CAP instead for this? Its a preexisting UAPI for > toggling KVM behaviors. Oh I wrote it this way because, even with the "args" field in "struct kvm_enable_cap" to express the enable/disable intent, it felt weird to allow disabling the feature through KVM_ENABLE_CAP. But it seems like that's the convention, so I'll make the change. > > 5. The kvm_run structure > > ======================== > > > > @@ -6544,6 +6556,21 @@ array field represents return values. The userspace should update the return > > values of SBI call before resuming the VCPU. For more details on RISC-V SBI > > spec refer, https://github.com/riscv/riscv-sbi-doc. > > + > > + /* KVM_EXIT_MEMORY_FAULT */ > > + struct { > > + __u64 gpa; > > + __u64 size; > > + } memory_fault; > > + > > How is userspace expected to differentiate the gup_fast() failed exit > from the guest-private memory exit? I don't think flags are a good idea > for this, as it comes with the illusion that both events can happen on a > single exit. In reality, these are mutually exclusive. > > A fault type/code would be better here, with the option to add flags at > a later date that could be used to further describe the exit (if > needed). Agreed. Something like this, then? + struct { + __u32 fault_code; + __u64 reserved; + __u64 gpa; + __u64 size; + } memory_fault; The "reserved" field is meant to be the placeholder for a future "flags" field. Let me know if there's a better/more conventional way to achieve this. On Wed, Feb 15, 2023 at 9:07 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Wed, Feb 15, 2023, Marc Zyngier wrote: > > On Wed, 15 Feb 2023 01:16:11 +0000, Anish Moorthy <amoorthy@xxxxxxxxxx> wrote: > > > 8. Other capabilities. > > > ====================== > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > index 109b18e2789c4..9352e7f8480fb 100644 > > > --- a/include/linux/kvm_host.h > > > +++ b/include/linux/kvm_host.h > > > @@ -801,6 +801,9 @@ struct kvm { > > > bool vm_bugged; > > > bool vm_dead; > > > > > > + rwlock_t mem_fault_nowait_lock; > > > + bool mem_fault_nowait; > > > > A full-fat rwlock to protect a single bool? What benefits do you > > expect from a rwlock? Why is it preferable to an atomic access, or a > > simple bitop? > > There's no need to have any kind off dedicated atomicity. The only readers are > in vCPU context, just disallow KVM_CAP_MEM_FAULT_NOWAIT after vCPUs are created. I think we do need atomicity here. When KVM_CAP_MEM_FAULT_NOWAIT is enabled async page faults are essentially disabled: so userspace will likely want to disable the cap at some point (such as the end of live migration post-copy). Since we want to support this without having to pause vCPUs, there's an atomicity requirement. Marc, I used an rwlock simply because it seemed like the easiest correct thing to do. I had a hunch that I'd be asked to change this to an atomic, so I can go ahead and do that barring any other suggestions. On Wed, Feb 15, 2023 at 12:41 AM Marc Zyngier <maz@xxxxxxxxxx> wrote: > > > +:Returns: 0 on success, or -1 if KVM_CAP_MEM_FAULT_NOWAIT is not present. > > Please pick a meaningful error code instead of -1. And if you intended > this as -EPERM, please explain the rationale (-EINVAL would seem more > appropriate). As I mention earlier, I'll be switching to toggling the capability via KVM_ENABLE_CAP so this KVM_SET_MEM_FAULT_NOWAIT ioctl is going to go away. I will make sure to set an appropriate error code if the user passes nonsensical "args" to KVM_ENABLE_CAP though, whatever that ends up meaning. > > + > > +Enables (state=true) or disables (state=false) waitless memory faults. For more > > +information, see the documentation of KVM_CAP_MEM_FAULT_NOWAIT. > > + > > 5. The kvm_run structure > > ======================== > > > > @@ -6544,6 +6556,21 @@ array field represents return values. The userspace should update the return > > values of SBI call before resuming the VCPU. For more details on RISC-V SBI > > spec refer, https://github.com/riscv/riscv-sbi-doc. > > > > +:: > > + > > + /* KVM_EXIT_MEMORY_FAULT */ > > + struct { > > + __u64 gpa; > > + __u64 size; > > + } memory_fault; > > + > > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has > > +encountered a memory error which is not handled by KVM kernel module and > > +which userspace may choose to handle. > > No, the vcpu hasn't "encountered a memory error". The hypervisor has > taken a page fault, which is very different. And it isn't that KVM > couldn't handle it (or we wouldn't have a hypervisor at all). From > what I understand of this series (possibly very little), userspace has > to *ask* for these, and they are delivered in specific circumstances. > Which are? Thanks for the correction: "encountered a memory error" was incorrect phrasing. What is your opinion on the following, hopefully more accurate, documentation? "An exit reason of KVM_EXIT_MEMORY_FAULT indicates that the vCPU encountered a memory fault for which the userspace page tables did not contain a present mapping. This exit is only generated when KVM_CAP_MEM_FAULT_NOWAIT is enabled: otherwise KVM will attempt to resolve the fault without exiting to userspace, returning -1 with errno=EFAULT when it cannot." It might also help to note that the vCPUs exit in exactly the same cases where userspace would get a page fault when trying to access the memory through the VMA/mapping provided to the memslot. Please let me know if you feel that this information should be included in the documentation, or if you see anything else I've gotten wrong. As for handling these faults: userspace should consider its own knowledge of the guest and determine how best to resolve each fault. For instance if userspace hasn't UFFDIO_COPY/CONTINUEd a faulting page yet, then it must do so to establish the mapping. If it already did, then the next step would be to fault in the page via MADV_POPULATE_READ|WRITE. A naive userspace could just always MADV_POPULATE_READ|WRITE in response to these faults: that would basically yield KVM's behavior *without* the capability enabled, just with extra steps, worse performance, and no chance for async page faults. A final note: this is just how the fault applies in the context of userfaultfd-based post-copy live migration. Other uses might come up in the future, in which case userspace would want to take some other sort of action. > > +'gpa' and 'size' indicate the memory range the error occurs at. Userspace > > +may handle the error and return to KVM to retry the previous memory access. > > What are these *exactly*? In what unit? Sorry, I'll add some descriptive comments here. "gpa" is the guest physical address of the faulting page (rounded down to be aligned with "size"), and "size" is just the size in bytes of the faulting page. > ... What guarantees that the > process eventually converges? How is userspace supposed to handle the > fault (which is *not* an error)? Don't you need to communicate other > information, such as the type of fault (read, write, permission or > translation fault...)? Ok, two parts to the response here. 1. As Oliver touches on earlier, we'll probably want to use this same field for different classes of memory fault in the future (such as the ones which Chao is introducing in [1]): so it does make sense to add "code" and "flags" fields which can be used to communicate more information to the user (and which can just be set to MEM_FAULT_NOWAIT/0 in this series). 2. As to why a code/flags of MEM_FAULT_NOWAIT/0 should be enough information to convey to the user here: the assumption behind this series is that MADV_POPULATE_READ|WRITE will always be enough to ensure that the vCPU can run again. This goes back to the earlier claim about vCPUs exiting in the exact same cases as when userspace would get a page fault trying to access the same memory. MADV_POPULATE_READ|WRITE is intended to resolve exactly these cases, so userspace shouldn't need anything more. Again, a VMM enabling the new exit to make post-copy faster must determine what action to take depending on whether it has UFFDIO_COPY|CONTINUEd the faulting page (at least for now, perhaps there will be more uses in the future): we can keep discussing if this seems fragile. Just keeping things limited to userfaultfd, having KVM communicate to userspace which action is required could get difficult. For UFFDIO_CONTINUE we'd have to check if the VMA had VM_UFFD_MINOR, and for UFFDIO_COPY we'd have to at least check the page cache. > > > > +7.34 KVM_CAP_MEM_FAULT_NOWAIT > > +----------------------------- > > + > > +:Architectures: x86, arm64 > > +:Target: VM > > +:Parameters: None > > +:Returns: 0 on success, or -EINVAL if capability is not supported. > > + > > +The presence of this capability indicates that userspace can enable/disable > > +waitless memory faults through the KVM_SET_MEM_FAULT_NOWAIT ioctl. > > + > > +When waitless memory faults are enabled, fast get_user_pages failures when > > +handling EPT/Shadow Page Table violations will cause a vCPU exit > > +(KVM_EXIT_MEMORY_FAULT) instead of a fallback to slow get_user_pages. > > Do you really expect a random VMM hacker to understand what GUP is? > Also, the x86 verbiage makes zero sense to me. Please write this in a > way that is useful for userspace people, because they are the ones > consuming this documentation. I really can't imagine anyone being able > to write something useful based on this. My bad: obviously I shouldn't be exposing the kernel's implementation details here. As for what the documentation should actually say, I'm thinking I can adapt the revised description of the KVM exit from earlier ("An exit reason of KVM_EXIT_MEMORY_FAULT indicates..."). Again, please let me know if you see anything which should be changed there. > > + /* KVM_EXIT_MEMORY_FAULT */ > > + struct { > > + __u64 flags; > > + __u64 gpa; > > + __u64 size; > > + } memory_fault; > > Sigh... This doesn't even match the documentation. Ack! Ack. > > -/* Available with KVM_CAP_S390_VCPU_RESETS */ > > +/* Available with KVM_CAP_S390_VCPU_RESETS */ > > Unrelated change? Yes, I'll drop it from the next revision. > > + r = -EFAULT; > > + if (copy_from_user(&state, argp, sizeof(state))) > > A copy_from_user for... a bool? Why don't you directly treat argp as > the boolean itself? Of even better, make it a more interesting > quantity so that the API can evolve in the future. As it stands, it is > not extensible, which means it is already dead, if Linux APIs are > anything to go by. I'll keep that in mind for the future, but this is now moot since I'll be removing KVM_SET_MEM_FAULT_NOWAIT. > On a related subject: is there a set of patches for any of the main > VMMs making any use of this? No. For what it's worth, the demand_paging_selftest demonstrates the basics of what userspace can do with the new exit. Would it be helpful to see how QEMU could use it as well? [1] https://lore.kernel.org/all/CA+EHjTyzZ2n8kQxH_Qx72aRq1k+dETJXTsoOM3tggPZAZkYbCA@xxxxxxxxxxxxxx/