Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field

Anish Moorthy <amoorthy@xxxxxxxxxx> · Tue, 21 Mar 2023 11:01:21 -0700

On Tue, Mar 21, 2023 at 8:21 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Mon, Mar 20, 2023, Anish Moorthy wrote:
> > On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > >
> > > On Fri, Mar 17, 2023, Anish Moorthy wrote:
> > > > On Fri, Mar 17, 2023 at 2:50 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > > > I wonder if we can get away with returning -EFAULT, but still filling vcpu->run
> > > > > with KVM_EXIT_MEMORY_FAULT and all the other metadata.  That would likely simplify
> > > > > the implementation greatly, and would let KVM fill vcpu->run unconditonally.  KVM
> > > > > would still need a capability to advertise support to userspace, but userspace
> > > > > wouldn't need to opt in.  I think this may have been my very original though, and
> > > > > I just never actually wrote it down...
> > > >
> > > > Oh, good to know that's actually an option. I thought of that too, but
> > > > assumed that returning a negative error code was a no-go for a proper
> > > > vCPU exit. But if that's not true then I think it's the obvious
> > > > solution because it precludes any uncaught behavior-change bugs.
> > > >
> > > > A couple of notes
> > > > 1. Since we'll likely miss some -EFAULT returns, we'll need to make
> > > > sure that the user can check for / doesn't see a stale
> > > > kvm_run::memory_fault field when a missed -EFAULT makes it to
> > > > userspace. It's a small and easy-to-fix detail, but I thought I'd
> > > > point it out.
> > >
> > > Ya, this is the main concern for me as well.  I'm not as confident that it's
> > > easy-to-fix/avoid though.
> > >
> > > > 2. I don't think this would simplify the series that much, since we
> > > > still need to find the call sites returning -EFAULT to userspace and
> > > > populate memory_fault only in those spots to avoid populating it for
> > > > -EFAULTs which don't make it to userspace.
> > >
> > > Filling kvm_run::memory_fault even if KVM never exits to userspace is perfectly
> > > ok.  It's not ideal, but it's ok.
> > >
> > > > We *could* relax that condition and just document that memory_fault should be
> > > > ignored when KVM_RUN does not return -EFAULT... but I don't think that's a
> > > > good solution from a coder/maintainer perspective.
> > >
> > > You've got things backward.  memory_fault _must_ be ignored if KVM doesn't return
> > > the associated "magic combo", where the magic value is either "0+KVM_EXIT_MEMORY_FAULT"
> > > or "-EFAULT+KVM_EXIT_MEMORY_FAULT".
> > >
> > > Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace
> > > never sees the data, i.e. userspace is completely unaware.  This behavior is not
> > > ideal from a KVM perspective as allowing KVM to fill the kvm_run union without
> > > exiting to userspace can lead to other bugs, e.g. effective corruption of the
> > > kvm_run union, but at least from a uABI perspective, the behavior is acceptable.
> >
> > Actually, I don't think the idea of filling in kvm_run.memory_fault
> > for -EFAULTs which don't make it to userspace works at all. Consider
> > the direct_map function, which bubbles its -EFAULT to
> > kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both
> > kvm_arch_async_page_ready (which ignores the return value), and by
> > kvm_mmu_page_fault (where the return value does make it to userspace).
> > Populating kvm_run.memory_fault anywhere in or under
> > kvm_mmu_do_page_fault seems an immediate no-go, because a wayward
> > kvm_arch_async_page_ready could (presumably) overwrite/corrupt an
> > already-set kvm_run.memory_fault / other kvm_run field.
>
> This particular case is a non-issue.  kvm_check_async_pf_completion() is called
> only when the current task has control of the vCPU, i.e. is the current "running"
> vCPU.  That's not a coincidence either, invoking kvm_mmu_do_page_fault() without
> having control of the vCPU would be fraught with races, e.g. the entire KVM MMU
> context would be unstable.
>
> That will hold true for all cases.  Using a vCPU that is not loaded (not the
> current "running" vCPU in KVM's misleading terminology) to access guest memory is
> simply not safe, as the vCPU state is non-deterministic.  There are paths where
> KVM accesses, and even modifies, vCPU state asynchronously, e.g. for IRQ delivery
> and making requests, but those are very controlled flows with dedicated machinery
> to make them SMP safe.
>
> That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> without the target vCPU being loaded:
>
>         int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
>         {
>                 preempt_disable();
>                 if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
>                         goto out;
>
>                 vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
>                 ...
>         out:
>                 preempt_enable();
>                 return -EFAULT;
>         }
>
> FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing
> that KVM "immediately" exits to userspace isn't ideal, but given the amount of
> historical code that we need to deal with, it seems like the lesser of all evils.
> Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far
> better failure mode than KVM not filling kvm_run when it should, i.e. false
> positives are ok, false negatives are fatal.

Don't you have this in reverse? False negatives will just result in
userspace not having useful extra information for the -EFAULT it
receives from KVM_RUN, in which case userspace can do what you
mentioned all VMMs do today and just terminate the VM. Whereas a false
positive might cause a double-write to the KVM_RUN struct, either
putting incorrect information in kvm_run.memory_fault or corrupting
another member of the union.

> > That in turn looks problematic for the
> > memory-fault-exit-on-fast-gup-failure part of this series, because
> > there are at least a couple of cases for which kvm_mmu_do_page_fault
> > will -EFAULT. One is the early-efault-on-fast-gup-failure case which
> > was the original purpose of this series. Another is a -EFAULT from
> > FNAME(fetch) (passed up through FNAME(page_fault)). There might be
> > other cases as well. But unless userspace can/should resolve *all*
> > such -EFAULTs in the same manner, a kvm_run.memory_fault populated in
> > "kvm_mmu_page_fault" wouldn't be actionable.
>
> Killing the VM, which is what all VMMs do today in response to -EFAULT, is an
> action.  As I've pointed out elsewhere in this thread, userspace needs to be able
> to identify "faults" that it (userspace) can resolve without a hint from KVM.
>
> In other words, KVM is still returning -EFAULT (or a variant thereof), the _only_
> difference, for all intents and purposes, is that userspace is given a bit more
> information about the source of the -EFAULT.
>
> > At least, not without a whole lot of plumbing code to make it so.
>
> Plumbing where?

In this example, I meant plumbing code to get a
kvm_run.memory_fault.flags which is more specific than (eg)
MEMFAULT_REASON_PAGE_FAULT_FAILURE from the -EFAULT paths under
kvm_mmu_page_fault. My idea for how userspace would distinguish
fast-gup failures was that kvm_faultin_pfn would set a special bit in
kvm_run.memory_fault.flags to indicate its failure. But (still
assuming that we shouldn't have false-positive kvm_run.memory_fault
fills) if the memory_fault can only be populated from
kvm_mmu_page_fault then either failures from FNAME(page_fault) and
kvm_faultin_pfn will be indistinguishable to userspace, or those
functions will need to plumb more specific exit reasons all the way up
to kvm_mmu_page_fault.

But, since you've made this point elsewhere, my guess is that your
answer is that it's actually userspace's job to detect the "specific"
reason for the fault and resolve it.