On Sun, Aug 04, 2024, Michal Luczaj wrote: > On 8/1/24 14:41, Will Deacon wrote: > > On Wed, Jul 31, 2024 at 09:18:56AM -0700, Sean Christopherson wrote: > >> [...] > >> Ya, the basic problem is that we have two ways of publishing the vCPU, fd and > >> vcpu_array, with no way of setting both atomically. Given that xa_store() should > >> never fail, I vote we do the simple thing and deliberately leak the memory. > > > > I'm inclined to agree. This conversation did momentarily get me worried > > about the window between the successful create_vcpu_fd() and the > > xa_store(), but it looks like 'kvm->online_vcpus' protects that. > > > > I'll spin a v2 leaking the vCPU, then. > > But perhaps you're right. The window you've described may be an issue. > For example: > > static u64 get_time_ref_counter(struct kvm *kvm) > { > ... > vcpu = kvm_get_vcpu(kvm, 0); // may still be NULL > tsc = kvm_read_l1_tsc(vcpu, rdtsc()); > return mul_u64_u64_shr(tsc, hv->tsc_ref.tsc_scale, 64) > + hv->tsc_ref.tsc_offset; > } > > u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc) > { > return vcpu->arch.l1_tsc_offset + > kvm_scale_tsc(host_tsc, vcpu->arch.l1_tsc_scaling_ratio); > } > > After stuffing msleep() between fd install and vcpu_array store: > > [ 125.296110] BUG: kernel NULL pointer dereference, address: 0000000000000b38 > [ 125.296203] #PF: supervisor read access in kernel mode > [ 125.296266] #PF: error_code(0x0000) - not-present page > [ 125.296327] PGD 12539e067 P4D 12539e067 PUD 12539d067 PMD 0 > [ 125.296392] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI > [ 125.296454] CPU: 12 UID: 1000 PID: 1179 Comm: a.out Not tainted 6.11.0-rc1nokasan+ #19 > [ 125.296521] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 > [ 125.296585] RIP: 0010:kvm_read_l1_tsc+0x6/0x50 [kvm] > [ 125.297376] Call Trace: > [ 125.297430] <TASK> > [ 125.297919] get_time_ref_counter+0x70/0x90 [kvm] > [ 125.298039] kvm_hv_get_msr_common+0xc1/0x7d0 [kvm] > [ 125.298150] __kvm_get_msr+0x72/0xf0 [kvm] > [ 125.298421] do_get_msr+0x16/0x50 [kvm] > [ 125.298531] msr_io+0x9d/0x110 [kvm] > [ 125.298626] kvm_arch_vcpu_ioctl+0xdc5/0x19c0 [kvm] > [ 125.299345] kvm_vcpu_ioctl+0x6cc/0x920 [kvm] > [ 125.299540] __x64_sys_ioctl+0x90/0xd0 > [ 125.299582] do_syscall_64+0x93/0x180 > [ 125.300206] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 125.300243] RIP: 0033:0x7f2d64aded2d > > So, is get_time_ref_counter() broken (with a trivial fix) or should it be > considered a regression after commit afb2acb2e3a3 > ("KVM: Fix vcpu_array[0] races")? The latter, though arguably afb2acb2e3a3 isn't really a regression since it essentially just reverts back to the pre-Xarray code, i.e. the bug was always there, it was just temporarily masked by a worst bug. I don't think we want to go down the path of declaring get_time_ref_counter() broken, because that is going to result in an impossible programming model. Ha! We can kill two birds with one stone. If we take vcpu->mutex before installing the file descriptor, and hold it until online_vcpus is bumped, userspace Argh, so close, kvm_arch_vcpu_async_ioctl() throws a wrench in that idea. Double argh, whether or not an ioctl is async is buried in arch code. I still think it makes sense to grab vcpu->mutex for synchronous ioctls. That way there's no vibisle change to userspace, and we can lean on that code to reject the async ioctls, as I can't imagine there's a practical use case for emitting an an async ioctl without first doing a synchronous ioctl. E.g. in addition to the below patch, plus changes to add kvm_arch_is_async_vcpu_ioctl(): /* * Some architectures have vcpu ioctls that are asynchronous to vcpu * execution; mutex_lock() would break them. Disallow asynchronous * ioctls until the vCPU is fully online. This can only happen if * userspace has *never* a done a synchronous ioctl, as acquiring the * vCPU's mutex ensures the vCPU is online, i.e. isn't a restriction * for any practical use case. */ if (kvm_arch_is_async_vcpu_ioctl(ioctl)) { if (vcpu->vcpu_idx < atomic_read(&kvm->online_vcpus)) return -EINVAL; return kvm_vcpu_async_ioctl(filp, ioctl, arg); } Alternatively, we could go for the super simple change and cross our fingers that no "real" VMM emits vCPU ioctls before KVM_CREATE_VCPU returns. diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d0788d0a72cc..9ae9022a015f 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4450,6 +4450,9 @@ static long kvm_vcpu_ioctl(struct file *filp, if (unlikely(_IOC_TYPE(ioctl) != KVMIO)) return -EINVAL; + if (unlikely(vcpu->vcpu_idx < atomic_read(&kvm->online_vcpus))) + return -EINVAL; + /* * Some architectures have vcpu ioctls that are asynchronous to vcpu * execution; mutex_lock() would break them. The mutex approach, sans async ioctl support: --- virt/kvm/kvm_main.c | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d0788d0a72cc..0a9c390b18a3 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4269,12 +4269,6 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id) mutex_lock(&kvm->lock); -#ifdef CONFIG_LOCKDEP - /* Ensure that lockdep knows vcpu->mutex is taken *inside* kvm->lock */ - mutex_lock(&vcpu->mutex); - mutex_unlock(&vcpu->mutex); -#endif - if (kvm_get_vcpu_by_id(kvm, id)) { r = -EEXIST; goto unlock_vcpu_destroy; @@ -4285,15 +4279,29 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id) if (r) goto unlock_vcpu_destroy; - /* Now it's all set up, let userspace reach it */ + /* + * Now it's all set up, let userspace reach it. Grab the vCPU's mutex + * so that userspace can't invoke vCPU ioctl()s until the vCPU is fully + * visibile (per online_vcpus), e.g. so that KVM doesn't get tricked + * into a NULL-pointer dereference because KVM thinks the _current_ + * vCPU doesn't exist. As a bonus, taking vcpu->mutex ensures lockdep + * knows it's taken *inside* kvm->lock. + */ + mutex_lock(&vcpu->mutex); kvm_get_kvm(kvm); r = create_vcpu_fd(vcpu); if (r < 0) goto kvm_put_xa_release; + /* + * xa_store() should never fail, see xa_reserve() above. Leak the vCPU + * if the impossible happens, as userspace already has access to the + * vCPU, i.e. freeing the vCPU before userspace puts its file reference + * would trigger a use-after-free. + */ if (KVM_BUG_ON(xa_store(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, 0), kvm)) { - r = -EINVAL; - goto kvm_put_xa_release; + mutex_unlock(&vcpu->mutex); + return -EINVAL; } /* @@ -4302,6 +4310,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id) */ smp_wmb(); atomic_inc(&kvm->online_vcpus); + mutex_unlock(&vcpu->mutex); mutex_unlock(&kvm->lock); kvm_arch_vcpu_postcreate(vcpu); @@ -4309,6 +4318,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id) return r; kvm_put_xa_release: + mutex_unlock(&vcpu->mutex); kvm_put_kvm_no_destroy(kvm); xa_release(&kvm->vcpu_array, vcpu->vcpu_idx); unlock_vcpu_destroy: base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f --