On Tue, Mar 21, 2023 at 12:43 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Tue, Mar 21, 2023, Anish Moorthy wrote: > > On Tue, Mar 21, 2023 at 8:21 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > On Mon, Mar 20, 2023, Anish Moorthy wrote: > > > > On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace > > > > > never sees the data, i.e. userspace is completely unaware. This behavior is not > > > > > ideal from a KVM perspective as allowing KVM to fill the kvm_run union without > > > > > exiting to userspace can lead to other bugs, e.g. effective corruption of the > > > > > kvm_run union, but at least from a uABI perspective, the behavior is acceptable. > > > > > > > > Actually, I don't think the idea of filling in kvm_run.memory_fault > > > > for -EFAULTs which don't make it to userspace works at all. Consider > > > > the direct_map function, which bubbles its -EFAULT to > > > > kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both > > > > kvm_arch_async_page_ready (which ignores the return value), and by > > > > kvm_mmu_page_fault (where the return value does make it to userspace). > > > > Populating kvm_run.memory_fault anywhere in or under > > > > kvm_mmu_do_page_fault seems an immediate no-go, because a wayward > > > > kvm_arch_async_page_ready could (presumably) overwrite/corrupt an > > > > already-set kvm_run.memory_fault / other kvm_run field. > > > > > > This particular case is a non-issue. kvm_check_async_pf_completion() is called > > > only when the current task has control of the vCPU, i.e. is the current "running" > > > vCPU. That's not a coincidence either, invoking kvm_mmu_do_page_fault() without > > > having control of the vCPU would be fraught with races, e.g. the entire KVM MMU > > > context would be unstable. > > > > > > That will hold true for all cases. Using a vCPU that is not loaded (not the > > > current "running" vCPU in KVM's misleading terminology) to access guest memory is > > > simply not safe, as the vCPU state is non-deterministic. There are paths where > > > KVM accesses, and even modifies, vCPU state asynchronously, e.g. for IRQ delivery > > > and making requests, but those are very controlled flows with dedicated machinery > > > to make them SMP safe. > > > > > > That said, I agree that there's a risk that KVM could clobber vcpu->run_run by > > > hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g. > > > the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called > > > without the target vCPU being loaded: > > > > > > int kvm_handle_efault(struct kvm_vcpu *vcpu, ...) > > > { > > > preempt_disable(); > > > if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu))) > > > goto out; > > > > > > vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; > > > ... > > > out: > > > preempt_enable(); > > > return -EFAULT; > > > } > > > > > > FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing > > > that KVM "immediately" exits to userspace isn't ideal, but given the amount of > > > historical code that we need to deal with, it seems like the lesser of all evils. > > > Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far > > > better failure mode than KVM not filling kvm_run when it should, i.e. false > > > positives are ok, false negatives are fatal. > > > > Don't you have this in reverse? > > No, I don't think so. > > > False negatives will just result in userspace not having useful extra > > information for the -EFAULT it receives from KVM_RUN, in which case userspace > > can do what you mentioned all VMMs do today and just terminate the VM. > > And that is _really_ bad behavior if we have any hope of userspace actually being > able to rely on this functionality. E.g. any false negative when userspace is > trying to do postcopy demand paging will be fatal to the VM. > > > Whereas a false positive might cause a double-write to the KVM_RUN struct, > > either putting incorrect information in kvm_run.memory_fault or > > Recording unused information on -EFAULT in kvm_run doesn't make the information > incorrect. Let's say that some function (converted to annotate its EFAULTs) fills in kvm_run.memory_fault, but the EFAULT is suppressed from being returned from kvm_run. What if, later within the same kvm_run call, some other function (which we've completely overlooked) EFAULTs and that return value actually does make it out to kvm_run? Userspace would get stale information, which could be catastrophic. Actually even performing the annotations only in functions that currently always bubble EFAULTs to userspace still seems brittle: if new callers are ever added which don't bubble the EFAULTs, then we end up in the same situation.