Re: [PATCH v4 03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 14 Jun 2023 10:35:17 -0700

On Fri, Jun 02, 2023, Anish Moorthy wrote:
> +The 'gpa' and 'len' (in bytes) fields describe the range of guest
> +physical memory to which access failed, i.e. [gpa, gpa + len). 'flags' is a
> +bitfield indicating the nature of the access: valid masks are
> +
> +  - KVM_MEMORY_FAULT_FLAG_WRITE:     The failed access was a write.
> +  - KVM_MEMORY_FAULT_FLAG_EXEC:      The failed access was an exec.

We should also define a READ flag, even though it's not strictly necessary.  That
gives userspace another way to detect "bad" data (flags should never be zero), and
it will allow us to use the same RWX bits that KVM_SET_MEMORY_ATTRIBUTES uses,
e.g. R = BIT(0), W = BIT(1), X = BIT(2) (which just so happens to be the same
RWX definitions EPT uses).

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0e571e973bc2..69a221f71914 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2288,4 +2288,13 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>  
> +/*
> + * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and
> + * populate the memory_fault field with the given information.
> + *
> + * WARNs and does nothing if the exit reason is not KVM_EXIT_UNKNOWN, or if
> + * 'vcpu' is not the current running vcpu.
> + */
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,

Tagging a globally visible, non-static function as "inline" is odd, to say the
least.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fd80a560378c..09d4d85691e1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4674,6 +4674,9 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  
>  		return r;
>  	}
> +	case KVM_CAP_MEMORY_FAULT_INFO: {

No need for curly braces.  But that's moot because there's no need for this at
all, just let it fall through to the default handling, which KVM already does for
a number of other informational capabilities.

> +		return -EINVAL;
> +	}
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> @@ -6173,3 +6176,35 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>  
>  	return init_context.err;
>  }
> +
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,

I strongly prefer to avoid "populate" and "efault".  Avoid "populate" because
that verb will become stale the instance we do anything else in the helper.
Avoid "efault" because it's imprecise, i.e. this isn't to be used for just any
old -EFAULT scenario.  Something like kvm_handle_guest_uaccess_fault()? Definitely
open to other names (especially less verbose names).

> +				     uint64_t gpa, uint64_t len, uint64_t flags)
> +{
> +	if (WARN_ON_ONCE(!vcpu))
> +		return;

Drop this and instead invoke the helper if and only if vCPU is guaranteed to be
valid, e.g. in a future patch, don't add a conditional call to __kvm_write_guest_page(),
just handle the -EFAULT in kvm_vcpu_write_guest_page().  If the concern is that
callers would need to manually check "r == -EFAULT", this helper could take in the
error code, same as we do for kvm_handle_memory_failure(), e.g. do 

	if (r != -EFAULT)
		return;

here.  This is also an argument for using a less prescriptive name for the helper.

> +
> +	preempt_disable();
> +	/*
> +	 * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> +	 * would open the door for races between concurrent calls to this
> +	 * function.
> +	 */
> +	if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> +		goto out;

Meh, this is overkill IMO.  The check in mark_page_dirty_in_slot() is an
abomination that I wish didn't exist, not a pattern that should be copied.  If
we do keep this sanity check, it can simply be

	if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()))
		return;

because as the comment for kvm_get_running_vcpu() explains, the returned vCPU
pointer won't change even if this task gets migrated to a different pCPU.  If
this code were doing something with vcpu->cpu then preemption would need to be
disabled throughout, but that's not the case.

> +	/*
> +	 * Try not to overwrite an already-populated run struct.
> +	 * This isn't a perfect solution, as there's no guarantee that the exit
> +	 * reason is set before the run struct is populated, but it should prevent
> +	 * at least some bugs.
> +	 */
> +	else if

Kernel style is to not use if-elif-elif if any of the preceding checks are terminal,
i.e. return or goto.  There's a good reason for that style/rule too, as it allows
avoiding weirdness like this where there's a big block comment in the middle of
an if-elif sequence.

> (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))

As I've stated multiple times, this can't WARN in "normal" builds because userspace
can modify kvm_run fields at will.  I do want a WARN as it will allow fuzzers to
find bugs for us, but it needs to be guarded with a Kconfig (or maybe a module
param).  One idea would be to make the proposed CONFIG_KVM_PROVE_MMU[*] a generic
Kconfig and use that.

And this should not be a terminal condition, i.e. KVM should WARN but continue on.
I am like 99% confident there are existing cases where KVM fills exit_reason
without actually exiting, i.e. bailing will immediately "break" KVM.  On the other
hand, clobbering what came before *might* break KVM, but it might work too.  More
thoughts below.

[*] https://lkml.kernel.org/r/20230511235917.639770-8-seanjc%40google.com

> +		goto out;

Folding in your other reply, as I wanted the full original context.

> What I intended this check to guard against was the first problematic
> case (A) I called out in the cover letter
>
> > The implementation strategy for KVM_CAP_MEMORY_FAULT_INFO has risks: for
> > example, if there are any existing paths in KVM_RUN which cause a vCPU
> > to (1) populate the kvm_run struct then (2) fail a vCPU guest memory
> > access but ignore the failure and then (3) complete the exit to
> > userspace set up in (1), then the contents of the kvm_run struct written
> > in (1) will be corrupted.
>
> What I wrote was actually incorrect, as you may see: if in (1) the
> exit reason != KVM_EXIT_UNKNOWN then the exit-reason-unset check will
> prevent writing to the run struct. Now, if for some reason this flow
> involved populating most of the run struct in (1) but only setting the
> exit reason in (3) then we'd still have a problem: but it's not
> feasible to anticipate everything after all :)
>
> I also mentioned a different error case (B)
>
> > Another example: if KVM_RUN fails a guest memory access for which the
> > EFAULT is annotated but does not return the EFAULT to userspace, then
> > later returns an *un*annotated EFAULT to userspace, then userspace will
> > receive incorrect information.
>
> When the second EFAULT is un-annotated the presence/absence of the
> exit-reason-unset check is irrelevant: userspace will observe an
> annotated EFAULT in place of an un-annotated one either way.
>
> There's also a third interesting case (C) which I didn't mention: an
> annotated EFAULT which is ignored/suppressed followed by one which is
> propagated to userspace. Here the exit-reason-unset check will prevent
> the second annotation from being written, so userspace sees an
> annotation with bad contents, If we believe that case (A) is a weird
> sequence of events that shouldn't be happening in the first place,
> then case (C) seems more important to ensure correctn`ess in. But I
> don't know anything about how often (A) happens in KVM, which is why I
> want others' opinions.
>
> So, should we drop the exit-reason-unset check (and the accompanying
> patch 4) and treat existing occurrences of case (A) as bugs, or should
> we maintain the check at the cost of incorrect behavior in case (C)?
> Or is there another option here entirely?
>
> Sean, I remember you calling out that some of the emulated mmio code
> follows the pattern in (A), but it's been a while and my memory is
> fuzzy. What's your opinion here?

I got a bit (ok, way more than a bit) lost in all of the (A) (B) (C) madness.  I
think this what you intended for each case?

  (A) if there are any existing paths in KVM_RUN which cause a vCPU
      to (1) populate the kvm_run struct then (2) fail a vCPU guest memory
      access but ignore the failure and then (3) complete the exit to
      userspace set up in (1), then the contents of the kvm_run struct written
      in (1) will be corrupted.

  (B) if KVM_RUN fails a guest memory access for which the EFAULT is annotated
      but does not return the EFAULT to userspace, then later returns an *un*annotated
      EFAULT to userspace, then userspace will receive incorrect information.

  (C) an annotated EFAULT which is ignored/suppressed followed by one which is
      propagated to userspace. Here the exit-reason-unset check will prevent the
      second annotation from being written, so userspace sees an annotation with
      bad contents, If we believe that case (A) is a weird sequence of events
      that shouldn't be happening in the first place, then case (C) seems more
      important to ensure correctness in. But I don't know anything about how often
      (A) happens in KVM, which is why I want others' opinions.

(A) does sadly happen.  I wouldn't call it a "pattern" though, it's an unfortunate
side effect of deficiencies in KVM's uAPI.

(B) is the trickiest to defend against in the kernel, but as I mentioned in earlier
versions of this series, userspace needs to guard against a vCPU getting stuck in
an infinite fault anyways, so I'm not _that_ concerned with figuring out a way to
address this in the kernel.  KVM's documentation should strongly encourage userspace
to take action if KVM repeatedly exits with the same info over and over, but beyond
that I think anything else is nice to have, not mandatory.

(C) should simply not be possible.  (A) is very much a "this shouldn't happen,
but it does" case.  KVM provides no meaningful guarantees if (A) does happen, so
yes, prioritizing correctness for (C) is far more important.

That said, prioritizing (C) doesn't mean we can't also do our best to play nice
with (A).  None of the existing exits use anywhere near the exit info union's 256
bytes, i.e. there is tons of space to play with.  So rather than put memory_fault
in with all the others, what if we split the union in two, and place memory_fault
in the high half (doesn't have to literally be half, but you get the idea).  It'd
kinda be similar to x86's contributory vs. benign faults; exits that can't be
"nested" or "speculative" go in the low half, and things like memory_fault go in
the high half.

That way, if (A) does occur, the original information will be preserved when KVM
fills memory_fault.  And my suggestion to WARN-and-continue limits the problematic
scenarios to just fields in the second union, i.e. just memory_fault for now.
At the very least, not clobbering would likely make it easier for us to debug when
things go sideways.

And rather than use kvm_run.exit_reason as the canary, we should carve out a
kernel-only, i.e. non-ABI, field from the union.  That would allow setting the
canary in common KVM code, which can't be done for kvm_run.exit_reason because
some architectures, e.g. s390 (and x86 IIRC), consume the exit_reason early on
in KVM_RUN.

E.g. something like this (the #ifdefs are heinous, it might be better to let
userspace see the exit_canary, but make it abundantly clear that it's not ABI).

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 143abb334f56..233702124e0a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -511,16 +511,43 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID     (1 << 0)
                        __u32 flags;
                } notify;
+               /* Fix the size of the union. */
+               char padding[128];
+       };
+
+       /*
+        * This second KVM_EXIT_* union holds structures for exits that may be
+        * triggered after KVM has already initiated a different exit, and/or
+        * may be filled speculatively by KVM.  E.g. because of limitations in
+        * KVM's uAPI, a memory fault can be encountered after an MMIO exit is
+        * initiated and kvm_run.mmio is filled.  Isolating these structures
+        * from the primary KVM_EXIT_* union ensures that KVM won't clobber
+        * information for the original exit.
+        */
+       union {
                /* KVM_EXIT_MEMORY_FAULT */
                struct {
                        __u64 flags;
                        __u64 gpa;
                        __u64 len;
                } memory_fault;
-               /* Fix the size of the union. */
-               char padding[256];
+               /* Fix the size of this union too. */
+#ifndef __KERNEL__
+               char padding2[128];
+#else
+               char padding2[120];
+#endif
        };
 
+#ifdef __KERNEL__
+       /*
+        * Non-ABI, kernel-only field that KVM uses to detect bugs related to
+        * filling exit_reason and the exit unions, e.g. to guard against
+        * clobbering a previous exit.
+        */
+       __u64 exit_canary;
+#endif
+
        /* 2048 is the size of the char array used to bound/pad the size
         * of the union that holds sync regs.
         */