Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field

Anish Moorthy <amoorthy@xxxxxxxxxx> · Wed, 22 Mar 2023 14:06:55 -0700

On Tue, Mar 21, 2023 at 12:43 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Tue, Mar 21, 2023, Anish Moorthy wrote:
> > On Tue, Mar 21, 2023 at 8:21 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Mar 20, 2023, Anish Moorthy wrote:
> > > > On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > > > Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace
> > > > > never sees the data, i.e. userspace is completely unaware.  This behavior is not
> > > > > ideal from a KVM perspective as allowing KVM to fill the kvm_run union without
> > > > > exiting to userspace can lead to other bugs, e.g. effective corruption of the
> > > > > kvm_run union, but at least from a uABI perspective, the behavior is acceptable.
> > > >
> > > > Actually, I don't think the idea of filling in kvm_run.memory_fault
> > > > for -EFAULTs which don't make it to userspace works at all. Consider
> > > > the direct_map function, which bubbles its -EFAULT to
> > > > kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both
> > > > kvm_arch_async_page_ready (which ignores the return value), and by
> > > > kvm_mmu_page_fault (where the return value does make it to userspace).
> > > > Populating kvm_run.memory_fault anywhere in or under
> > > > kvm_mmu_do_page_fault seems an immediate no-go, because a wayward
> > > > kvm_arch_async_page_ready could (presumably) overwrite/corrupt an
> > > > already-set kvm_run.memory_fault / other kvm_run field.
> > >
> > > This particular case is a non-issue.  kvm_check_async_pf_completion() is called
> > > only when the current task has control of the vCPU, i.e. is the current "running"
> > > vCPU.  That's not a coincidence either, invoking kvm_mmu_do_page_fault() without
> > > having control of the vCPU would be fraught with races, e.g. the entire KVM MMU
> > > context would be unstable.
> > >
> > > That will hold true for all cases.  Using a vCPU that is not loaded (not the
> > > current "running" vCPU in KVM's misleading terminology) to access guest memory is
> > > simply not safe, as the vCPU state is non-deterministic.  There are paths where
> > > KVM accesses, and even modifies, vCPU state asynchronously, e.g. for IRQ delivery
> > > and making requests, but those are very controlled flows with dedicated machinery
> > > to make them SMP safe.
> > >
> > > That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> > > hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> > > the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> > > without the target vCPU being loaded:
> > >
> > >         int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
> > >         {
> > >                 preempt_disable();
> > >                 if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> > >                         goto out;
> > >
> > >                 vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > >                 ...
> > >         out:
> > >                 preempt_enable();
> > >                 return -EFAULT;
> > >         }
> > >
> > > FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing
> > > that KVM "immediately" exits to userspace isn't ideal, but given the amount of
> > > historical code that we need to deal with, it seems like the lesser of all evils.
> > > Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far
> > > better failure mode than KVM not filling kvm_run when it should, i.e. false
> > > positives are ok, false negatives are fatal.
> >
> > Don't you have this in reverse?
>
> No, I don't think so.
>
> > False negatives will just result in userspace not having useful extra
> > information for the -EFAULT it receives from KVM_RUN, in which case userspace
> > can do what you mentioned all VMMs do today and just terminate the VM.
>
> And that is _really_ bad behavior if we have any hope of userspace actually being
> able to rely on this functionality.  E.g. any false negative when userspace is
> trying to do postcopy demand paging will be fatal to the VM.

But since -EFAULTs from KVM_RUN today are already fatal, so there's no
new failure introduced by an -EFAULT w/o a populated memory_fault
field right? Obviously that's of no real use to userspace, but that
seems like part of the point of starting with a partial conversion: to
allow for filling holes in the implementation in the future.

It seems like what you're really concerned about here is the
interaction with the memslot fast-gup-only flag. Obviously, failing to
populate kvm_run.memory_fault for new userspace-visible -EFAULTs
caused by that flag would cause new fatal failures for the guest,
which would make the feature actually harmful. But as far as I know
(and please lmk if I'm wrong), the memslot flag only needs to be used
by the kvm_handle_error_pfn (x86) and user_mem_abort (arm64)
functions, meaning that those are the only places where we need to
check/populate kvm_run.memory_fault for new userspace-visible
-EFAULTs.

> > Whereas a false positive might cause a double-write to the KVM_RUN struct,
> > either putting incorrect information in kvm_run.memory_fault or
>
> Recording unused information on -EFAULT in kvm_run doesn't make the information
> incorrect.
>
> > corrupting another member of the union.
>
> Only if KVM accesses guest memory after initiating an exit to userspace, which
> would be a KVM irrespective of kvm_run.memory_fault.

Ah good: I was concerned that this was a valid set of code paths in
KVM. Although I'm assuming that "initiating an exit to userspace"
includes the "returning -EFAULT from KVM_RUN" cases, because we
wouldn't want EFAULTs to stomp on each other as well (the
kvm_mmu_do_page_fault usages were supposed to be one such example,
though I'm glad to know that they're not a problem).

> And if we're really concerned about clobbering state, we could add hardening/auditing
> code to ensure that KVM actually exits when kvm_run.exit_reason is set (though there
> are a non-zero number of exceptions, e.g. the aformentioned MMIO mess, nested SVM/VMX
> pages, and probably a few others).
>
> Prior to cleanups a few years back[2], emulation failures had issues similar to
> what we are discussing, where KVM would fail to exit to userspace, not fill kvm_run,
> etc.  Those are the types of bugs I want to avoid here.
>
> [1] https://lkml.kernel.org/r/ZBNrWZQhMX8AHzWM%40google.com
> [2] https://lore.kernel.org/kvm/20190823010709.24879-1-sean.j.christopherson@xxxxxxxxx
>
> > > > That in turn looks problematic for the
> > > > memory-fault-exit-on-fast-gup-failure part of this series, because
> > > > there are at least a couple of cases for which kvm_mmu_do_page_fault
> > > > will -EFAULT. One is the early-efault-on-fast-gup-failure case which
> > > > was the original purpose of this series. Another is a -EFAULT from
> > > > FNAME(fetch) (passed up through FNAME(page_fault)). There might be
> > > > other cases as well. But unless userspace can/should resolve *all*
> > > > such -EFAULTs in the same manner, a kvm_run.memory_fault populated in
> > > > "kvm_mmu_page_fault" wouldn't be actionable.
> > >
> > > Killing the VM, which is what all VMMs do today in response to -EFAULT, is an
> > > action.  As I've pointed out elsewhere in this thread, userspace needs to be able
> > > to identify "faults" that it (userspace) can resolve without a hint from KVM.
> > >
> > > In other words, KVM is still returning -EFAULT (or a variant thereof), the _only_
> > > difference, for all intents and purposes, is that userspace is given a bit more
> > > information about the source of the -EFAULT.
> > >
> > > > At least, not without a whole lot of plumbing code to make it so.
> > >
> > > Plumbing where?
> >
> > In this example, I meant plumbing code to get a kvm_run.memory_fault.flags
> > which is more specific than (eg) MEMFAULT_REASON_PAGE_FAULT_FAILURE from the
> > -EFAULT paths under kvm_mmu_page_fault. My idea for how userspace would
> > distinguish fast-gup failures was that kvm_faultin_pfn would set a special
> > bit in kvm_run.memory_fault.flags to indicate its failure. But (still
> > assuming that we shouldn't have false-positive kvm_run.memory_fault fills) if
> > the memory_fault can only be populated from kvm_mmu_page_fault then either
> > failures from FNAME(page_fault) and kvm_faultin_pfn will be indistinguishable
> > to userspace, or those functions will need to plumb more specific exit
> > reasons all the way up to kvm_mmu_page_fault.
>
> Setting a flag that essentially says "failure when handling a guest page fault"
> is problematic on multiple fronts.  Tying the ABI to KVM's internal implementation
> is not an option, i.e. the ABI would need to be defined as "on page faults from
> the guest".  And then the resulting behavior would be non-deterministic, e.g.
> userspace would see different behavior if KVM accessed a "bad" gfn via emulation
> instead of in response to a guest page fault.  And because of hardware TLBs, it
> would even be possible for the behavior to be non-deterministic on the same
> platform running the same guest code (though this would be exteremly unliklely
> in practice).
>
> And even if userspace is ok with only handling guest page faults_today_, I highly
> doubt that will hold forever.  I.e. at some point there will be a use case that
> wants to react to uaccess failures on fast-only memslots.
>
> Ignoring all of those issues, simplify flagging "this -EFAULT occurred when
> handling a guest page fault" isn't precise enough for userspace to blindly resolve
> the failure.  Even if KVM went through the trouble of setting information if and
> only if get_user_page_fast_only() failed while handling a guest page fault,
> userspace would still need/want a way to verify that the failure was expected and
> can be resolved, e.g. to guard against userspace bugs due to wrongly unmapping
> or mprotecting a page.
>
> > But, since you've made this point elsewhere, my guess is that your answer is
> > that it's actually userspace's job to detect the "specific" reason for the
> > fault and resolve it.
>
> Yes, it's userspace's responsibity.  I simply don't see how KVM can provide
> information that userspace doesn't already have without creating an unmaintainable
> uABI, at least not without some deep, deep plumbing into gup().  I.e. unless gup()
> were changed to explicitly communicate that it failed because of a uffd equivalent,
> at best a flag in kvm_run would be a hint that userspace _might_ be able to resolve
> the fault.  And even if we modified gup(), we'd still have all the open questions
> about what to do when KVM encounters a fault on a uaccess.