Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 21, 2023, Anish Moorthy wrote:
> On Tue, Mar 21, 2023 at 8:21 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Mon, Mar 20, 2023, Anish Moorthy wrote:
> > > On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > > Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace
> > > > never sees the data, i.e. userspace is completely unaware.  This behavior is not
> > > > ideal from a KVM perspective as allowing KVM to fill the kvm_run union without
> > > > exiting to userspace can lead to other bugs, e.g. effective corruption of the
> > > > kvm_run union, but at least from a uABI perspective, the behavior is acceptable.
> > >
> > > Actually, I don't think the idea of filling in kvm_run.memory_fault
> > > for -EFAULTs which don't make it to userspace works at all. Consider
> > > the direct_map function, which bubbles its -EFAULT to
> > > kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both
> > > kvm_arch_async_page_ready (which ignores the return value), and by
> > > kvm_mmu_page_fault (where the return value does make it to userspace).
> > > Populating kvm_run.memory_fault anywhere in or under
> > > kvm_mmu_do_page_fault seems an immediate no-go, because a wayward
> > > kvm_arch_async_page_ready could (presumably) overwrite/corrupt an
> > > already-set kvm_run.memory_fault / other kvm_run field.
> >
> > This particular case is a non-issue.  kvm_check_async_pf_completion() is called
> > only when the current task has control of the vCPU, i.e. is the current "running"
> > vCPU.  That's not a coincidence either, invoking kvm_mmu_do_page_fault() without
> > having control of the vCPU would be fraught with races, e.g. the entire KVM MMU
> > context would be unstable.
> >
> > That will hold true for all cases.  Using a vCPU that is not loaded (not the
> > current "running" vCPU in KVM's misleading terminology) to access guest memory is
> > simply not safe, as the vCPU state is non-deterministic.  There are paths where
> > KVM accesses, and even modifies, vCPU state asynchronously, e.g. for IRQ delivery
> > and making requests, but those are very controlled flows with dedicated machinery
> > to make them SMP safe.
> >
> > That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> > hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> > the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> > without the target vCPU being loaded:
> >
> >         int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
> >         {
> >                 preempt_disable();
> >                 if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> >                         goto out;
> >
> >                 vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> >                 ...
> >         out:
> >                 preempt_enable();
> >                 return -EFAULT;
> >         }
> >
> > FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing
> > that KVM "immediately" exits to userspace isn't ideal, but given the amount of
> > historical code that we need to deal with, it seems like the lesser of all evils.
> > Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far
> > better failure mode than KVM not filling kvm_run when it should, i.e. false
> > positives are ok, false negatives are fatal.
> 
> Don't you have this in reverse?

No, I don't think so.

> False negatives will just result in userspace not having useful extra
> information for the -EFAULT it receives from KVM_RUN, in which case userspace
> can do what you mentioned all VMMs do today and just terminate the VM.

And that is _really_ bad behavior if we have any hope of userspace actually being
able to rely on this functionality.  E.g. any false negative when userspace is
trying to do postcopy demand paging will be fatal to the VM.

> Whereas a false positive might cause a double-write to the KVM_RUN struct,
> either putting incorrect information in kvm_run.memory_fault or

Recording unused information on -EFAULT in kvm_run doesn't make the information
incorrect.

> corrupting another member of the union.

Only if KVM accesses guest memory after initiating an exit to userspace, which
would be a KVM irrespective of kvm_run.memory_fault.  We actually have exactly
this type of bug today in the trainwreck that is KVM's MMIO emulation[*], but
KVM gets away with the shoddy behavior by virtue of the scenario simply not
triggered by any real-world code.

And if we're really concerned about clobbering state, we could add hardening/auditing
code to ensure that KVM actually exits when kvm_run.exit_reason is set (though there
are a non-zero number of exceptions, e.g. the aformentioned MMIO mess, nested SVM/VMX
pages, and probably a few others).

Prior to cleanups a few years back[2], emulation failures had issues similar to
what we are discussing, where KVM would fail to exit to userspace, not fill kvm_run,
etc.  Those are the types of bugs I want to avoid here.

[1] https://lkml.kernel.org/r/ZBNrWZQhMX8AHzWM%40google.com
[2] https://lore.kernel.org/kvm/20190823010709.24879-1-sean.j.christopherson@xxxxxxxxx

> > > That in turn looks problematic for the
> > > memory-fault-exit-on-fast-gup-failure part of this series, because
> > > there are at least a couple of cases for which kvm_mmu_do_page_fault
> > > will -EFAULT. One is the early-efault-on-fast-gup-failure case which
> > > was the original purpose of this series. Another is a -EFAULT from
> > > FNAME(fetch) (passed up through FNAME(page_fault)). There might be
> > > other cases as well. But unless userspace can/should resolve *all*
> > > such -EFAULTs in the same manner, a kvm_run.memory_fault populated in
> > > "kvm_mmu_page_fault" wouldn't be actionable.
> >
> > Killing the VM, which is what all VMMs do today in response to -EFAULT, is an
> > action.  As I've pointed out elsewhere in this thread, userspace needs to be able
> > to identify "faults" that it (userspace) can resolve without a hint from KVM.
> >
> > In other words, KVM is still returning -EFAULT (or a variant thereof), the _only_
> > difference, for all intents and purposes, is that userspace is given a bit more
> > information about the source of the -EFAULT.
> >
> > > At least, not without a whole lot of plumbing code to make it so.
> >
> > Plumbing where?
> 
> In this example, I meant plumbing code to get a kvm_run.memory_fault.flags
> which is more specific than (eg) MEMFAULT_REASON_PAGE_FAULT_FAILURE from the
> -EFAULT paths under kvm_mmu_page_fault. My idea for how userspace would
> distinguish fast-gup failures was that kvm_faultin_pfn would set a special
> bit in kvm_run.memory_fault.flags to indicate its failure. But (still
> assuming that we shouldn't have false-positive kvm_run.memory_fault fills) if
> the memory_fault can only be populated from kvm_mmu_page_fault then either
> failures from FNAME(page_fault) and kvm_faultin_pfn will be indistinguishable
> to userspace, or those functions will need to plumb more specific exit
> reasons all the way up to kvm_mmu_page_fault.

Setting a flag that essentially says "failure when handling a guest page fault"
is problematic on multiple fronts.  Tying the ABI to KVM's internal implementation
is not an option, i.e. the ABI would need to be defined as "on page faults from
the guest".  And then the resulting behavior would be non-deterministic, e.g.
userspace would see different behavior if KVM accessed a "bad" gfn via emulation
instead of in response to a guest page fault.  And because of hardware TLBs, it
would even be possible for the behavior to be non-deterministic on the same
platform running the same guest code (though this would be exteremly unliklely
in practice).

And even if userspace is ok with only handling guest page faults_today_, I highly
doubt that will hold forever.  I.e. at some point there will be a use case that
wants to react to uaccess failures on fast-only memslots.

Ignoring all of those issues, simplify flagging "this -EFAULT occurred when
handling a guest page fault" isn't precise enough for userspace to blindly resolve
the failure.  Even if KVM went through the trouble of setting information if and
only if get_user_page_fast_only() failed while handling a guest page fault,
userspace would still need/want a way to verify that the failure was expected and
can be resolved, e.g. to guard against userspace bugs due to wrongly unmapping
or mprotecting a page.

> But, since you've made this point elsewhere, my guess is that your answer is
> that it's actually userspace's job to detect the "specific" reason for the
> fault and resolve it.

Yes, it's userspace's responsibity.  I simply don't see how KVM can provide
information that userspace doesn't already have without creating an unmaintainable
uABI, at least not without some deep, deep plumbing into gup().  I.e. unless gup()
were changed to explicitly communicate that it failed because of a uffd equivalent,
at best a flag in kvm_run would be a hint that userspace _might_ be able to resolve
the fault.  And even if we modified gup(), we'd still have all the open questions
about what to do when KVM encounters a fault on a uaccess.




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux