Re: [PATCH v2] KVM: Mark a vCPU as preempted/ready iff it's scheduled out while running

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 26 Apr 2024 14:01:15 -0700

On Thu, Mar 07, 2024, David Matlack wrote:
> Mark a vCPU as preempted/ready if-and-only-if it's scheduled out while
> running. i.e. Do not mark a vCPU preempted/ready if it's scheduled out
> during a non-KVM_RUN ioctl() or when userspace is doing KVM_RUN with
> immediate_exit.
> 
> Commit 54aa83c90198 ("KVM: x86: do not set st->preempted when going back
> to user space") stopped marking a vCPU as preempted when returning to
> userspace, but if userspace then invokes a KVM vCPU ioctl() that gets
> preempted, the vCPU will be marked preempted/ready. This is arguably
> incorrect behavior since the vCPU was not actually preempted while the
> guest was running, it was preempted while doing something on behalf of
> userspace.
> 
> This commit also avoids KVM dirtying guest memory after userspace has
> paused vCPUs, e.g. for Live Migration, which allows userspace to collect
> the final dirty bitmap before or in parallel with saving vCPU state
> without having to worry about saving vCPU state triggering writes to
> guest memory.
> 
> Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> Signed-off-by: David Matlack <dmatlack@xxxxxxxxxx>
> ---
> v2:
>  - Drop Google-specific "PRODKERNEL: " shortlog prefix
> 
> v1: https://lore.kernel.org/kvm/20231218185850.1659570-1-dmatlack@xxxxxxxxxx/
> 
>  include/linux/kvm_host.h | 1 +
>  virt/kvm/kvm_main.c      | 5 ++++-
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7e7fd25b09b3..5b2300614d22 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -378,6 +378,7 @@ struct kvm_vcpu {
>  		bool dy_eligible;
>  	} spin_loop;
>  #endif
> +	bool wants_to_run;
>  	bool preempted;
>  	bool ready;
>  	struct kvm_vcpu_arch arch;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ff588677beb7..3da1b2e3785d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4438,7 +4438,10 @@ static long kvm_vcpu_ioctl(struct file *filp,
>  				synchronize_rcu();
>  			put_pid(oldpid);
>  		}
> +		vcpu->wants_to_run = !vcpu->run->immediate_exit;

>  		r = kvm_arch_vcpu_ioctl_run(vcpu);
> +		vcpu->wants_to_run = false;
> +
>  		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
>  		break;
>  	}
> @@ -6312,7 +6315,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
>  {
>  	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
>  
> -	if (current->on_rq) {
> +	if (current->on_rq && vcpu->wants_to_run) {
>  		WRITE_ONCE(vcpu->preempted, true);
>  		WRITE_ONCE(vcpu->ready, true);
>  	}
> 
> base-commit: 687d8f4c3dea0758afd748968d91288220bbe7e3

Long story short, I was playing around with wants_to_run for a few hairbrained
ideas, and realized that there's a TOCTOU bug here.  Userspace can toggle
run->immediate_exit at will, e.g. can clear it after the kernel loads it to
compute vcpu->wants_to_run.

That's not fatal for this use case, since userspace would only be shooting itself
in the foot, but it leaves a very dangerous landmine, e.g. if something else in
KVM keys off of vcpu->wants_to_run to detect that KVM is in its run loop, i.e.
relies on wants_to_run being set if KVM is in its core run loop.

To address that, I think we should have all architectures check wants_to_run, not
immediate_exit.  And loading immediate_exit needs to be done with READ_ONCE().

E.g. for x86 (every other arch has similar code)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e9ef1fa4b90b..1a2f6bf14fb2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11396,7 +11396,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 
        kvm_vcpu_srcu_read_lock(vcpu);
        if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
-               if (kvm_run->immediate_exit) {
+               if (!vcpu->wants_to_run) {
                        r = -EINTR;
                        goto out;
                }
@@ -11474,7 +11474,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
                WARN_ON_ONCE(vcpu->mmio_needed);
        }
 
-       if (kvm_run->immediate_exit) {
+       if (!vcpu->wants_to_run) {
                r = -EINTR;
                goto out;
        }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f9b9ce0c3cd9..0c0aae224000 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1497,9 +1497,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
                                        struct kvm_guest_debug *dbg);
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu);
 
-void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu);
-
-void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool sched_in);
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id);
 int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9501fbd5dfd2..4384bbdba65c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4410,7 +4410,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
                                synchronize_rcu();
                        put_pid(oldpid);
                }
-               vcpu->wants_to_run = !vcpu->run->immediate_exit;
+               vcpu->wants_to_run = !READ_ONCE(vcpu->run->immediate_exit);
                r = kvm_arch_vcpu_ioctl_run(vcpu);
                vcpu->wants_to_run = false;


---

Hmm, and we should probably go a step further and actively prevent using
immediate_exit from the kernel, e.g. rename it to something scary like:

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..9c5fe1dae744 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -196,7 +196,11 @@ struct kvm_xen_exit {
 struct kvm_run {
        /* in */
        __u8 request_interrupt_window;
+#ifndef __KERNEL__
        __u8 immediate_exit;
+#else
+       __u8 hidden_do_not_touch;
+#endif
        __u8 padding1[6];
 
        /* out */