Re: [PATCH] KVM: Fix error path in kvm_vm_ioctl_create_vcpu() on xa_store() failure

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 5 Aug 2024 15:56:49 -0700

On Sun, Aug 04, 2024, Michal Luczaj wrote:
> On 8/1/24 14:41, Will Deacon wrote:
> > On Wed, Jul 31, 2024 at 09:18:56AM -0700, Sean Christopherson wrote:
> >> [...]
> >> Ya, the basic problem is that we have two ways of publishing the vCPU, fd and
> >> vcpu_array, with no way of setting both atomically.  Given that xa_store() should
> >> never fail, I vote we do the simple thing and deliberately leak the memory.
> > 
> > I'm inclined to agree. This conversation did momentarily get me worried
> > about the window between the successful create_vcpu_fd() and the
> > xa_store(), but it looks like 'kvm->online_vcpus' protects that.
> > 
> > I'll spin a v2 leaking the vCPU, then.
> 
> But perhaps you're right. The window you've described may be an issue.
> For example:
> 
> static u64 get_time_ref_counter(struct kvm *kvm)
> {
> 	...
> 	vcpu = kvm_get_vcpu(kvm, 0); // may still be NULL
> 	tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> 	return mul_u64_u64_shr(tsc, hv->tsc_ref.tsc_scale, 64)
> 		+ hv->tsc_ref.tsc_offset;
> }
> 
> u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
> {
> 	return vcpu->arch.l1_tsc_offset +
> 		kvm_scale_tsc(host_tsc, vcpu->arch.l1_tsc_scaling_ratio);
> }
> 
> After stuffing msleep() between fd install and vcpu_array store:
> 
> [  125.296110] BUG: kernel NULL pointer dereference, address: 0000000000000b38
> [  125.296203] #PF: supervisor read access in kernel mode
> [  125.296266] #PF: error_code(0x0000) - not-present page
> [  125.296327] PGD 12539e067 P4D 12539e067 PUD 12539d067 PMD 0
> [  125.296392] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> [  125.296454] CPU: 12 UID: 1000 PID: 1179 Comm: a.out Not tainted 6.11.0-rc1nokasan+ #19
> [  125.296521] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> [  125.296585] RIP: 0010:kvm_read_l1_tsc+0x6/0x50 [kvm]
> [  125.297376] Call Trace:
> [  125.297430]  <TASK>
> [  125.297919]  get_time_ref_counter+0x70/0x90 [kvm]
> [  125.298039]  kvm_hv_get_msr_common+0xc1/0x7d0 [kvm]
> [  125.298150]  __kvm_get_msr+0x72/0xf0 [kvm]
> [  125.298421]  do_get_msr+0x16/0x50 [kvm]
> [  125.298531]  msr_io+0x9d/0x110 [kvm]
> [  125.298626]  kvm_arch_vcpu_ioctl+0xdc5/0x19c0 [kvm]
> [  125.299345]  kvm_vcpu_ioctl+0x6cc/0x920 [kvm]
> [  125.299540]  __x64_sys_ioctl+0x90/0xd0
> [  125.299582]  do_syscall_64+0x93/0x180
> [  125.300206]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  125.300243] RIP: 0033:0x7f2d64aded2d
> 
> So, is get_time_ref_counter() broken (with a trivial fix) or should it be
> considered a regression after commit afb2acb2e3a3
> ("KVM: Fix vcpu_array[0] races")?

The latter, though arguably afb2acb2e3a3 isn't really a regression since it
essentially just reverts back to the pre-Xarray code, i.e. the bug was always
there, it was just temporarily masked by a worst bug.

I don't think we want to go down the path of declaring get_time_ref_counter()
broken, because that is going to result in an impossible programming model.

Ha!  We can kill two birds with one stone.  If we take vcpu->mutex before installing
the file descriptor, and hold it until online_vcpus is bumped, userspace

Argh, so close, kvm_arch_vcpu_async_ioctl() throws a wrench in that idea.  Double
argh, whether or not an ioctl is async is buried in arch code.

I still think it makes sense to grab vcpu->mutex for synchronous ioctls.  That
way there's no vibisle change to userspace, and we can lean on that code to reject
the async ioctls, as I can't imagine there's a practical use case for emitting an
an async ioctl without first doing a synchronous ioctl.  E.g. in addition to the
below patch, plus changes to add kvm_arch_is_async_vcpu_ioctl():

	/*
	 * Some architectures have vcpu ioctls that are asynchronous to vcpu
	 * execution; mutex_lock() would break them.  Disallow asynchronous
	 * ioctls until the vCPU is fully online.  This can only happen if
	 * userspace has *never* a done a synchronous ioctl, as acquiring the
	 * vCPU's mutex ensures the vCPU is online, i.e. isn't a restriction
	 * for any practical use case.
	 */
	if (kvm_arch_is_async_vcpu_ioctl(ioctl)) {
		if (vcpu->vcpu_idx < atomic_read(&kvm->online_vcpus))
			return -EINVAL;
		return kvm_vcpu_async_ioctl(filp, ioctl, arg);
	}

Alternatively, we could go for the super simple change and cross our fingers that
no "real" VMM emits vCPU ioctls before KVM_CREATE_VCPU returns.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc..9ae9022a015f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4450,6 +4450,9 @@ static long kvm_vcpu_ioctl(struct file *filp,
        if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
                return -EINVAL;
 
+       if (unlikely(vcpu->vcpu_idx < atomic_read(&kvm->online_vcpus)))
+               return -EINVAL;
+
        /*
         * Some architectures have vcpu ioctls that are asynchronous to vcpu
         * execution; mutex_lock() would break them.



The mutex approach, sans async ioctl support:

---
 virt/kvm/kvm_main.c | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc..0a9c390b18a3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4269,12 +4269,6 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
 
 	mutex_lock(&kvm->lock);
 
-#ifdef CONFIG_LOCKDEP
-	/* Ensure that lockdep knows vcpu->mutex is taken *inside* kvm->lock */
-	mutex_lock(&vcpu->mutex);
-	mutex_unlock(&vcpu->mutex);
-#endif
-
 	if (kvm_get_vcpu_by_id(kvm, id)) {
 		r = -EEXIST;
 		goto unlock_vcpu_destroy;
@@ -4285,15 +4279,29 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
 	if (r)
 		goto unlock_vcpu_destroy;
 
-	/* Now it's all set up, let userspace reach it */
+	/*
+	 * Now it's all set up, let userspace reach it.  Grab the vCPU's mutex
+	 * so that userspace can't invoke vCPU ioctl()s until the vCPU is fully
+	 * visibile (per online_vcpus), e.g. so that KVM doesn't get tricked
+	 * into a NULL-pointer dereference because KVM thinks the _current_
+	 * vCPU doesn't exist.  As a bonus, taking vcpu->mutex ensures lockdep
+	 * knows it's taken *inside* kvm->lock.
+	 */
+	mutex_lock(&vcpu->mutex);
 	kvm_get_kvm(kvm);
 	r = create_vcpu_fd(vcpu);
 	if (r < 0)
 		goto kvm_put_xa_release;
 
+	/*
+	 * xa_store() should never fail, see xa_reserve() above.  Leak the vCPU
+	 * if the impossible happens, as userspace already has access to the
+	 * vCPU, i.e. freeing the vCPU before userspace puts its file reference
+	 * would trigger a use-after-free.
+	 */
 	if (KVM_BUG_ON(xa_store(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, 0), kvm)) {
-		r = -EINVAL;
-		goto kvm_put_xa_release;
+		mutex_unlock(&vcpu->mutex);
+		return -EINVAL;
 	}
 
 	/*
@@ -4302,6 +4310,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
 	 */
 	smp_wmb();
 	atomic_inc(&kvm->online_vcpus);
+	mutex_unlock(&vcpu->mutex);
 
 	mutex_unlock(&kvm->lock);
 	kvm_arch_vcpu_postcreate(vcpu);
@@ -4309,6 +4318,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
 	return r;
 
 kvm_put_xa_release:
+	mutex_unlock(&vcpu->mutex);
 	kvm_put_kvm_no_destroy(kvm);
 	xa_release(&kvm->vcpu_array, vcpu->vcpu_idx);
 unlock_vcpu_destroy:

base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f
--