Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field

Anish Moorthy <amoorthy@xxxxxxxxxx> · Tue, 28 Mar 2023 15:19:45 -0700

On Tue, Mar 21, 2023 at 12:43 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Tue, Mar 21, 2023, Anish Moorthy wrote:
> > On Tue, Mar 21, 2023 at 8:21 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Mar 20, 2023, Anish Moorthy wrote:
> > > > On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > > > Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace
> > > > > never sees the data, i.e. userspace is completely unaware.  This behavior is not
> > > > > ideal from a KVM perspective as allowing KVM to fill the kvm_run union without
> > > > > exiting to userspace can lead to other bugs, e.g. effective corruption of the
> > > > > kvm_run union, but at least from a uABI perspective, the behavior is acceptable.
> > > >
> > > > Actually, I don't think the idea of filling in kvm_run.memory_fault
> > > > for -EFAULTs which don't make it to userspace works at all. Consider
> > > > the direct_map function, which bubbles its -EFAULT to
> > > > kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both
> > > > kvm_arch_async_page_ready (which ignores the return value), and by
> > > > kvm_mmu_page_fault (where the return value does make it to userspace).
> > > > Populating kvm_run.memory_fault anywhere in or under
> > > > kvm_mmu_do_page_fault seems an immediate no-go, because a wayward
> > > > kvm_arch_async_page_ready could (presumably) overwrite/corrupt an
> > > > already-set kvm_run.memory_fault / other kvm_run field.
> > >
> > > This particular case is a non-issue.  kvm_check_async_pf_completion() is called
> > > only when the current task has control of the vCPU, i.e. is the current "running"
> > > vCPU.  That's not a coincidence either, invoking kvm_mmu_do_page_fault() without
> > > having control of the vCPU would be fraught with races, e.g. the entire KVM MMU
> > > context would be unstable.
> > >
> > > That will hold true for all cases.  Using a vCPU that is not loaded (not the
> > > current "running" vCPU in KVM's misleading terminology) to access guest memory is
> > > simply not safe, as the vCPU state is non-deterministic.  There are paths where
> > > KVM accesses, and even modifies, vCPU state asynchronously, e.g. for IRQ delivery
> > > and making requests, but those are very controlled flows with dedicated machinery
> > > to make them SMP safe.
> > >
> > > That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> > > hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> > > the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> > > without the target vCPU being loaded:
> > >
> > >         int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
> > >         {
> > >                 preempt_disable();
> > >                 if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> > >                         goto out;
> > >
> > >                 vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > >                 ...
> > >         out:
> > >                 preempt_enable();
> > >                 return -EFAULT;
> > >         }
> > >
> > > FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing
> > > that KVM "immediately" exits to userspace isn't ideal, but given the amount of
> > > historical code that we need to deal with, it seems like the lesser of all evils.
> > > Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far
> > > better failure mode than KVM not filling kvm_run when it should, i.e. false
> > > positives are ok, false negatives are fatal.
> >
> > Don't you have this in reverse?
>
> No, I don't think so.
>
> > False negatives will just result in userspace not having useful extra
> > information for the -EFAULT it receives from KVM_RUN, in which case userspace
> > can do what you mentioned all VMMs do today and just terminate the VM.
>
> And that is _really_ bad behavior if we have any hope of userspace actually being
> able to rely on this functionality.  E.g. any false negative when userspace is
> trying to do postcopy demand paging will be fatal to the VM.
>
> > Whereas a false positive might cause a double-write to the KVM_RUN struct,
> > either putting incorrect information in kvm_run.memory_fault or
>
> Recording unused information on -EFAULT in kvm_run doesn't make the information
> incorrect.

Let's say that some function (converted to annotate its EFAULTs) fills
in kvm_run.memory_fault, but the EFAULT is suppressed from being
returned from kvm_run. What if, later within the same kvm_run call,
some other function (which we've completely overlooked) EFAULTs and
that return value actually does make it out to kvm_run? Userspace
would get stale information, which could be catastrophic.

Actually even performing the annotations only in functions that
currently always bubble EFAULTs to userspace still seems brittle: if
new callers are ever added which don't bubble the EFAULTs, then we end
up in the same situation.