On Mon, Aug 12, 2024, Aneesh Kumar K.V wrote: > Anish Moorthy <amoorthy@xxxxxxxxxx> writes: > > > Right now userspace just gets a bare EFAULT when the stage-2 fault > > handler fails to fault in the relevant page. Set up a > > KVM_EXIT_MEMORY_FAULT whenever this happens, which at the very least > > eases debugging and might also let userspace decide on/take some > > specific action other than crashing the VM. > > > > In some cases, user_mem_abort() EFAULTs before the size of the fault is > > calculated: return 0 in these cases to indicate that the fault is of > > unknown size. > > > > VMMs are now converting private memory to shared or vice-versa on vcpu > exit due to memory fault. This change will require VMM track each page's > private/shared state so that they can now handle an exit fault on a > shared memory where the fault happened due to reasons other than > conversion. I don't see how filling kvm_run.memory_fault in more locations changes anything. The userspace exits are inherently racy, e.g. userspace may have already converted the page to the appropriate state, thus making KVM's exit spurious. So either the VMM already tracks state, or the VMM blindly converts to shared/private. > Should we make it easy by adding additional flag bits to > indicate the fault was due to attribute and access type mismatch? Like above, describing _why_ an exit occurred is problematic when an exit races with a "fix" from userspace. It's also problematic when there are multiple possible faults, e.g. if the guest attempts to write to private memory, but userspace has the memory mapped as read-only, shared (contrived, but possible). Describing only the fault that KVM's see means the vCPU will encounter multiple faults, and userspace will end up getting multiple exits Instead, KVM should describe the access that led to the fault, as planned in the original series[1][2]. Userpace can then get the page into the correct state straightaway, or take punitive action if the guest is misbehaving. if (is_write) vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_WRITE; else if (is_exec) vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_EXEC; else vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_READ; That said, I'm a little hesitant to capture RWX information without a use case, mainly because it will require a new capability for userspace to be able to rely on the information. In hindsight, it probably would have been better to capture RWX information in the initial implementation. Doh. [1] https://lore.kernel.org/all/ZIn6VQSebTRN1jtX@xxxxxxxxxx [2] https://lore.kernel.org/all/ZR4N8cwzTMDanPUY@xxxxxxxxxx