On Tue, Mar 21, 2023 at 12:43 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Tue, Mar 21, 2023, Anish Moorthy wrote: > > On Tue, Mar 21, 2023 at 8:21 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > On Mon, Mar 20, 2023, Anish Moorthy wrote: > > > > On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace > > > > > never sees the data, i.e. userspace is completely unaware. This behavior is not > > > > > ideal from a KVM perspective as allowing KVM to fill the kvm_run union without > > > > > exiting to userspace can lead to other bugs, e.g. effective corruption of the > > > > > kvm_run union, but at least from a uABI perspective, the behavior is acceptable. > > > > > > > > Actually, I don't think the idea of filling in kvm_run.memory_fault > > > > for -EFAULTs which don't make it to userspace works at all. Consider > > > > the direct_map function, which bubbles its -EFAULT to > > > > kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both > > > > kvm_arch_async_page_ready (which ignores the return value), and by > > > > kvm_mmu_page_fault (where the return value does make it to userspace). > > > > Populating kvm_run.memory_fault anywhere in or under > > > > kvm_mmu_do_page_fault seems an immediate no-go, because a wayward > > > > kvm_arch_async_page_ready could (presumably) overwrite/corrupt an > > > > already-set kvm_run.memory_fault / other kvm_run field. > > > > > > This particular case is a non-issue. kvm_check_async_pf_completion() is called > > > only when the current task has control of the vCPU, i.e. is the current "running" > > > vCPU. That's not a coincidence either, invoking kvm_mmu_do_page_fault() without > > > having control of the vCPU would be fraught with races, e.g. the entire KVM MMU > > > context would be unstable. > > > > > > That will hold true for all cases. Using a vCPU that is not loaded (not the > > > current "running" vCPU in KVM's misleading terminology) to access guest memory is > > > simply not safe, as the vCPU state is non-deterministic. There are paths where > > > KVM accesses, and even modifies, vCPU state asynchronously, e.g. for IRQ delivery > > > and making requests, but those are very controlled flows with dedicated machinery > > > to make them SMP safe. > > > > > > That said, I agree that there's a risk that KVM could clobber vcpu->run_run by > > > hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g. > > > the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called > > > without the target vCPU being loaded: > > > > > > int kvm_handle_efault(struct kvm_vcpu *vcpu, ...) > > > { > > > preempt_disable(); > > > if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu))) > > > goto out; > > > > > > vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; > > > ... > > > out: > > > preempt_enable(); > > > return -EFAULT; > > > } > > > > > > FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing > > > that KVM "immediately" exits to userspace isn't ideal, but given the amount of > > > historical code that we need to deal with, it seems like the lesser of all evils. > > > Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far > > > better failure mode than KVM not filling kvm_run when it should, i.e. false > > > positives are ok, false negatives are fatal. > > > > Don't you have this in reverse? > > No, I don't think so. > > > False negatives will just result in userspace not having useful extra > > information for the -EFAULT it receives from KVM_RUN, in which case userspace > > can do what you mentioned all VMMs do today and just terminate the VM. > > And that is _really_ bad behavior if we have any hope of userspace actually being > able to rely on this functionality. E.g. any false negative when userspace is > trying to do postcopy demand paging will be fatal to the VM. But since -EFAULTs from KVM_RUN today are already fatal, so there's no new failure introduced by an -EFAULT w/o a populated memory_fault field right? Obviously that's of no real use to userspace, but that seems like part of the point of starting with a partial conversion: to allow for filling holes in the implementation in the future. It seems like what you're really concerned about here is the interaction with the memslot fast-gup-only flag. Obviously, failing to populate kvm_run.memory_fault for new userspace-visible -EFAULTs caused by that flag would cause new fatal failures for the guest, which would make the feature actually harmful. But as far as I know (and please lmk if I'm wrong), the memslot flag only needs to be used by the kvm_handle_error_pfn (x86) and user_mem_abort (arm64) functions, meaning that those are the only places where we need to check/populate kvm_run.memory_fault for new userspace-visible -EFAULTs. > > Whereas a false positive might cause a double-write to the KVM_RUN struct, > > either putting incorrect information in kvm_run.memory_fault or > > Recording unused information on -EFAULT in kvm_run doesn't make the information > incorrect. > > > corrupting another member of the union. > > Only if KVM accesses guest memory after initiating an exit to userspace, which > would be a KVM irrespective of kvm_run.memory_fault. Ah good: I was concerned that this was a valid set of code paths in KVM. Although I'm assuming that "initiating an exit to userspace" includes the "returning -EFAULT from KVM_RUN" cases, because we wouldn't want EFAULTs to stomp on each other as well (the kvm_mmu_do_page_fault usages were supposed to be one such example, though I'm glad to know that they're not a problem). > And if we're really concerned about clobbering state, we could add hardening/auditing > code to ensure that KVM actually exits when kvm_run.exit_reason is set (though there > are a non-zero number of exceptions, e.g. the aformentioned MMIO mess, nested SVM/VMX > pages, and probably a few others). > > Prior to cleanups a few years back[2], emulation failures had issues similar to > what we are discussing, where KVM would fail to exit to userspace, not fill kvm_run, > etc. Those are the types of bugs I want to avoid here. > > [1] https://lkml.kernel.org/r/ZBNrWZQhMX8AHzWM%40google.com > [2] https://lore.kernel.org/kvm/20190823010709.24879-1-sean.j.christopherson@xxxxxxxxx > > > > > That in turn looks problematic for the > > > > memory-fault-exit-on-fast-gup-failure part of this series, because > > > > there are at least a couple of cases for which kvm_mmu_do_page_fault > > > > will -EFAULT. One is the early-efault-on-fast-gup-failure case which > > > > was the original purpose of this series. Another is a -EFAULT from > > > > FNAME(fetch) (passed up through FNAME(page_fault)). There might be > > > > other cases as well. But unless userspace can/should resolve *all* > > > > such -EFAULTs in the same manner, a kvm_run.memory_fault populated in > > > > "kvm_mmu_page_fault" wouldn't be actionable. > > > > > > Killing the VM, which is what all VMMs do today in response to -EFAULT, is an > > > action. As I've pointed out elsewhere in this thread, userspace needs to be able > > > to identify "faults" that it (userspace) can resolve without a hint from KVM. > > > > > > In other words, KVM is still returning -EFAULT (or a variant thereof), the _only_ > > > difference, for all intents and purposes, is that userspace is given a bit more > > > information about the source of the -EFAULT. > > > > > > > At least, not without a whole lot of plumbing code to make it so. > > > > > > Plumbing where? > > > > In this example, I meant plumbing code to get a kvm_run.memory_fault.flags > > which is more specific than (eg) MEMFAULT_REASON_PAGE_FAULT_FAILURE from the > > -EFAULT paths under kvm_mmu_page_fault. My idea for how userspace would > > distinguish fast-gup failures was that kvm_faultin_pfn would set a special > > bit in kvm_run.memory_fault.flags to indicate its failure. But (still > > assuming that we shouldn't have false-positive kvm_run.memory_fault fills) if > > the memory_fault can only be populated from kvm_mmu_page_fault then either > > failures from FNAME(page_fault) and kvm_faultin_pfn will be indistinguishable > > to userspace, or those functions will need to plumb more specific exit > > reasons all the way up to kvm_mmu_page_fault. > > Setting a flag that essentially says "failure when handling a guest page fault" > is problematic on multiple fronts. Tying the ABI to KVM's internal implementation > is not an option, i.e. the ABI would need to be defined as "on page faults from > the guest". And then the resulting behavior would be non-deterministic, e.g. > userspace would see different behavior if KVM accessed a "bad" gfn via emulation > instead of in response to a guest page fault. And because of hardware TLBs, it > would even be possible for the behavior to be non-deterministic on the same > platform running the same guest code (though this would be exteremly unliklely > in practice). > > And even if userspace is ok with only handling guest page faults_today_, I highly > doubt that will hold forever. I.e. at some point there will be a use case that > wants to react to uaccess failures on fast-only memslots. > > Ignoring all of those issues, simplify flagging "this -EFAULT occurred when > handling a guest page fault" isn't precise enough for userspace to blindly resolve > the failure. Even if KVM went through the trouble of setting information if and > only if get_user_page_fast_only() failed while handling a guest page fault, > userspace would still need/want a way to verify that the failure was expected and > can be resolved, e.g. to guard against userspace bugs due to wrongly unmapping > or mprotecting a page. > > > But, since you've made this point elsewhere, my guess is that your answer is > > that it's actually userspace's job to detect the "specific" reason for the > > fault and resolve it. > > Yes, it's userspace's responsibity. I simply don't see how KVM can provide > information that userspace doesn't already have without creating an unmaintainable > uABI, at least not without some deep, deep plumbing into gup(). I.e. unless gup() > were changed to explicitly communicate that it failed because of a uffd equivalent, > at best a flag in kvm_run would be a hint that userspace _might_ be able to resolve > the fault. And even if we modified gup(), we'd still have all the open questions > about what to do when KVM encounters a fault on a uaccess.