Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 4 Apr 2023 15:07:02 -0700

On Tue, Apr 04, 2023, Anish Moorthy wrote:
> On Tue, Apr 4, 2023 at 12:35 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > Let's say that some function (converted to annotate its EFAULTs) fills
> > > in kvm_run.memory_fault, but the EFAULT is suppressed from being
> > > returned from kvm_run. What if, later within the same kvm_run call,
> > > some other function (which we've completely overlooked) EFAULTs and
> > > that return value actually does make it out to kvm_run? Userspace
> > > would get stale information, which could be catastrophic.
> >
> > "catastrophic" is a bit hyperbolic.  Yes, it would be bad, but at _worst_ userspace
> > will kill the VM, which is the status quo today.
> 
> Well what I'm saying is that in these cases userspace *wouldn't know*
> that kvm_run.memory_fault contains incorrect information for the
> -EFAULT it actually got (do you disagree?),

I disagree in the sense that if the stale information causes a problem, then by
definition userspace has to know.  It's the whole "if a tree falls in a forest"
thing.  If KVM reports stale information and literally nothing bad happens, ever,
then is the superfluous exit really a problem?  Not saying it wouldn't be treated
as a bug, just that it might not even warrant a stable backport if the worst case
scenario is a spurious exit to userspace (for example).

> which could presumably cause it to do bad things like "resolve" faults on
> incorrect pages and/or infinite-loop on KVM_RUN, etc.

Putting the vCPU into an infinite loop is _very_ visible, e.g. see the entire
mess surrounding commit 31c25585695a ("Revert "KVM: SVM: avoid infinite loop on
NPF from bad address"").

As above, fixing pages that don't need to be fixed isn't itself a major problem.
If the extra exits lead to a performance issue, then _that_ is a problem, but
again _something_ has to detect the problem and thus it becomes a known thing.

> Annotating the efault information as valid only from the call sites
> which return directly to userspace prevents this class of problem, at
> the cost of allowing un-annotated EFAULTs to make it to userspace. But
> to me, paying that cost to make sure the EFAULT information is always
> correct seems by far preferable to not paying it and allowing
> userspace to get silently incorrect information.

I don't think that's a maintainable approach.  Filling kvm_run if and only if the
-EFAULT has a direct path to userspace is (a) going to require a signficant amount
of code churn and (b) falls apart the instant code further up the stack changes.
E.g. the relatively straightforward page fault case requires bouncing through 7+
functions to get from kvm_handle_error_pfn() to kvm_arch_vcpu_ioctl_run(), and not
all of those are obviously "direct"

	if (IS_ENABLED(CONFIG_RETPOLINE) && fault.is_tdp)
		r = kvm_tdp_page_fault(vcpu, &fault);
	else
		r = vcpu->arch.mmu->page_fault(vcpu, &fault);

	if (fault.write_fault_to_shadow_pgtable && emulation_type)
		*emulation_type |= EMULTYPE_WRITE_PF_TO_SP;

	/*
	 * Similar to above, prefetch faults aren't truly spurious, and the
	 * async #PF path doesn't do emulation.  Do count faults that are fixed
	 * by the async #PF handler though, otherwise they'll never be counted.
	 */
	if (r == RET_PF_FIXED)
		vcpu->stat.pf_fixed++;
	else if (prefetch)
		;
	else if (r == RET_PF_EMULATE)
		vcpu->stat.pf_emulate++;
	else if (r == RET_PF_SPURIOUS)
		vcpu->stat.pf_spurious++;
	return r;

...

	if (r == RET_PF_INVALID) {
		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
					  lower_32_bits(error_code), false,
					  &emulation_type);
		if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
			return -EIO;
	}

	if (r < 0)
		return r;
	if (r != RET_PF_EMULATE)
		return 1;

In other words, the "only if it's direct" rule requires visually auditing changes,
i.e. catching "violations" via code review, not only to code that adds a new -EFAULT
return, but to all code throughout rather large swaths of KVM.  The odds of us (or
whoever the future maintainers/reviewers are) remembering to enforce the "rule", let
alone actually having 100% accuracy, are basically nil.

On the flip side, if we add a helper to fill kvm_run and return -EFAULT, then we can
add rule that only time KVM is allowed to return a bare -EFAULT is immediately after
a uaccess, i.e. after copy_to/from_user() and the many variants.  And _that_ can be
enforced through static checkers, e.g. someone with more (read: any) awk/sed skills
than me could bang something out in a matter of minutes.  Such a static checker won't
catch everything, but there would be very, very few bare non-uaccess -EFAULTS left,
and those could be filtered out with an allowlist, e.g. similar to how the folks that
run smatch and whatnot deal with false positives.