On Tue, Mar 21, 2023 at 8:21 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Mon, Mar 20, 2023, Anish Moorthy wrote: > > On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > On Fri, Mar 17, 2023, Anish Moorthy wrote: > > > > On Fri, Mar 17, 2023 at 2:50 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > I wonder if we can get away with returning -EFAULT, but still filling vcpu->run > > > > > with KVM_EXIT_MEMORY_FAULT and all the other metadata. That would likely simplify > > > > > the implementation greatly, and would let KVM fill vcpu->run unconditonally. KVM > > > > > would still need a capability to advertise support to userspace, but userspace > > > > > wouldn't need to opt in. I think this may have been my very original though, and > > > > > I just never actually wrote it down... > > > > > > > > Oh, good to know that's actually an option. I thought of that too, but > > > > assumed that returning a negative error code was a no-go for a proper > > > > vCPU exit. But if that's not true then I think it's the obvious > > > > solution because it precludes any uncaught behavior-change bugs. > > > > > > > > A couple of notes > > > > 1. Since we'll likely miss some -EFAULT returns, we'll need to make > > > > sure that the user can check for / doesn't see a stale > > > > kvm_run::memory_fault field when a missed -EFAULT makes it to > > > > userspace. It's a small and easy-to-fix detail, but I thought I'd > > > > point it out. > > > > > > Ya, this is the main concern for me as well. I'm not as confident that it's > > > easy-to-fix/avoid though. > > > > > > > 2. I don't think this would simplify the series that much, since we > > > > still need to find the call sites returning -EFAULT to userspace and > > > > populate memory_fault only in those spots to avoid populating it for > > > > -EFAULTs which don't make it to userspace. > > > > > > Filling kvm_run::memory_fault even if KVM never exits to userspace is perfectly > > > ok. It's not ideal, but it's ok. > > > > > > > We *could* relax that condition and just document that memory_fault should be > > > > ignored when KVM_RUN does not return -EFAULT... but I don't think that's a > > > > good solution from a coder/maintainer perspective. > > > > > > You've got things backward. memory_fault _must_ be ignored if KVM doesn't return > > > the associated "magic combo", where the magic value is either "0+KVM_EXIT_MEMORY_FAULT" > > > or "-EFAULT+KVM_EXIT_MEMORY_FAULT". > > > > > > Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace > > > never sees the data, i.e. userspace is completely unaware. This behavior is not > > > ideal from a KVM perspective as allowing KVM to fill the kvm_run union without > > > exiting to userspace can lead to other bugs, e.g. effective corruption of the > > > kvm_run union, but at least from a uABI perspective, the behavior is acceptable. > > > > Actually, I don't think the idea of filling in kvm_run.memory_fault > > for -EFAULTs which don't make it to userspace works at all. Consider > > the direct_map function, which bubbles its -EFAULT to > > kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both > > kvm_arch_async_page_ready (which ignores the return value), and by > > kvm_mmu_page_fault (where the return value does make it to userspace). > > Populating kvm_run.memory_fault anywhere in or under > > kvm_mmu_do_page_fault seems an immediate no-go, because a wayward > > kvm_arch_async_page_ready could (presumably) overwrite/corrupt an > > already-set kvm_run.memory_fault / other kvm_run field. > > This particular case is a non-issue. kvm_check_async_pf_completion() is called > only when the current task has control of the vCPU, i.e. is the current "running" > vCPU. That's not a coincidence either, invoking kvm_mmu_do_page_fault() without > having control of the vCPU would be fraught with races, e.g. the entire KVM MMU > context would be unstable. > > That will hold true for all cases. Using a vCPU that is not loaded (not the > current "running" vCPU in KVM's misleading terminology) to access guest memory is > simply not safe, as the vCPU state is non-deterministic. There are paths where > KVM accesses, and even modifies, vCPU state asynchronously, e.g. for IRQ delivery > and making requests, but those are very controlled flows with dedicated machinery > to make them SMP safe. > > That said, I agree that there's a risk that KVM could clobber vcpu->run_run by > hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g. > the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called > without the target vCPU being loaded: > > int kvm_handle_efault(struct kvm_vcpu *vcpu, ...) > { > preempt_disable(); > if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu))) > goto out; > > vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; > ... > out: > preempt_enable(); > return -EFAULT; > } > > FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing > that KVM "immediately" exits to userspace isn't ideal, but given the amount of > historical code that we need to deal with, it seems like the lesser of all evils. > Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far > better failure mode than KVM not filling kvm_run when it should, i.e. false > positives are ok, false negatives are fatal. Don't you have this in reverse? False negatives will just result in userspace not having useful extra information for the -EFAULT it receives from KVM_RUN, in which case userspace can do what you mentioned all VMMs do today and just terminate the VM. Whereas a false positive might cause a double-write to the KVM_RUN struct, either putting incorrect information in kvm_run.memory_fault or corrupting another member of the union. > > That in turn looks problematic for the > > memory-fault-exit-on-fast-gup-failure part of this series, because > > there are at least a couple of cases for which kvm_mmu_do_page_fault > > will -EFAULT. One is the early-efault-on-fast-gup-failure case which > > was the original purpose of this series. Another is a -EFAULT from > > FNAME(fetch) (passed up through FNAME(page_fault)). There might be > > other cases as well. But unless userspace can/should resolve *all* > > such -EFAULTs in the same manner, a kvm_run.memory_fault populated in > > "kvm_mmu_page_fault" wouldn't be actionable. > > Killing the VM, which is what all VMMs do today in response to -EFAULT, is an > action. As I've pointed out elsewhere in this thread, userspace needs to be able > to identify "faults" that it (userspace) can resolve without a hint from KVM. > > In other words, KVM is still returning -EFAULT (or a variant thereof), the _only_ > difference, for all intents and purposes, is that userspace is given a bit more > information about the source of the -EFAULT. > > > At least, not without a whole lot of plumbing code to make it so. > > Plumbing where? In this example, I meant plumbing code to get a kvm_run.memory_fault.flags which is more specific than (eg) MEMFAULT_REASON_PAGE_FAULT_FAILURE from the -EFAULT paths under kvm_mmu_page_fault. My idea for how userspace would distinguish fast-gup failures was that kvm_faultin_pfn would set a special bit in kvm_run.memory_fault.flags to indicate its failure. But (still assuming that we shouldn't have false-positive kvm_run.memory_fault fills) if the memory_fault can only be populated from kvm_mmu_page_fault then either failures from FNAME(page_fault) and kvm_faultin_pfn will be indistinguishable to userspace, or those functions will need to plumb more specific exit reasons all the way up to kvm_mmu_page_fault. But, since you've made this point elsewhere, my guess is that your answer is that it's actually userspace's job to detect the "specific" reason for the fault and resolve it.