Re: [PATCH] KVM: arm64: Mark the VM as dead for failed initializations

Marc Zyngier <maz@xxxxxxxxxx> · Sat, 26 Oct 2024 08:43:21 +0100

Hi both,

On Sat, 26 Oct 2024 06:34:23 +0100,
Oliver Upton <oliver.upton@xxxxxxxxx> wrote:
> 
> Hi Raghu,
> 
> Thanks for posting this fix.
> 
> On Fri, Oct 25, 2024 at 10:12:20PM +0000, Raghavendra Rao Ananta wrote:
> > Syzbot hit the following WARN_ON() in kvm_timer_update_irq():
> > 
> > WARNING: CPU: 0 PID: 3281 at arch/arm64/kvm/arch_timer.c:459
> > kvm_timer_update_irq+0x21c/0x394
> > Call trace:
> >   kvm_timer_update_irq+0x21c/0x394 arch/arm64/kvm/arch_timer.c:459
> >   kvm_timer_vcpu_reset+0x158/0x684 arch/arm64/kvm/arch_timer.c:968
> >   kvm_reset_vcpu+0x3b4/0x560 arch/arm64/kvm/reset.c:264
> >   kvm_vcpu_set_target arch/arm64/kvm/arm.c:1553 [inline]
> >   kvm_arch_vcpu_ioctl_vcpu_init arch/arm64/kvm/arm.c:1573 [inline]
> >   kvm_arch_vcpu_ioctl+0x112c/0x1b3c arch/arm64/kvm/arm.c:1695
> >   kvm_vcpu_ioctl+0x4ec/0xf74 virt/kvm/kvm_main.c:4658
> >   vfs_ioctl fs/ioctl.c:51 [inline]
> >   __do_sys_ioctl fs/ioctl.c:907 [inline]
> >   __se_sys_ioctl fs/ioctl.c:893 [inline]
> >   __arm64_sys_ioctl+0x108/0x184 fs/ioctl.c:893
> >   __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
> >   invoke_syscall+0x78/0x1b8 arch/arm64/kernel/syscall.c:49
> >   el0_svc_common+0xe8/0x1b0 arch/arm64/kernel/syscall.c:132
> >   do_el0_svc+0x40/0x50 arch/arm64/kernel/syscall.c:151
> >   el0_svc+0x54/0x14c arch/arm64/kernel/entry-common.c:712
> >   el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
> >   el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
> > 
> > The sequence that led to the report is when KVM_ARM_VCPU_INIT ioctl is
> > invoked after a failed first KVM_RUN. In a general sense though, since
> > kvm_arch_vcpu_run_pid_change() doesn't tear down any of the past
> > initiatializations, it's possible that the VM's state could be left
> 
> typo: initializations
> 
> > half-baked. Any upcoming ioctls could behave erroneously because of
> > this.
> 
> You may want to highlight a bit more strongly that, despite the name,
> we do a lot of late *VM* state initialization in kvm_arch_vcpu_run_pid_change().
> 
> When that goes sideways we're left with few choices besides bugging the
> VM or gracefully tearing down state, potentially w/ concurrent users.
> 
> > Since these late vCPU initializations is past the point of attributing
> > the failures to any ioctl, instead of tearing down each of the previous
> > setups, simply mark the VM as dead, gving an opportunity for the
> > userspace to close and try again.
> > 
> > Cc: <stable@xxxxxxxxxxxxxxx>
> > Reported-by: syzbot <syzkaller@xxxxxxxxxxxxxxxx>
> > Suggested-by: Oliver Upton <oliver.upton@xxxxxxxxx>
> 
> I definitely recommended this to you, so blame *me* for imposing some
> toil on you with the following.
> 
> > @@ -836,16 +836,16 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
> >  
> >  	ret = kvm_timer_enable(vcpu);
> >  	if (ret)
> > -		return ret;
> > +		goto out_err;
> >  
> >  	ret = kvm_arm_pmu_v3_enable(vcpu);
> >  	if (ret)
> > -		return ret;
> > +		goto out_err;
> >  
> >  	if (is_protected_kvm_enabled()) {
> >  		ret = pkvm_create_hyp_vm(kvm);
> >  		if (ret)
> > -			return ret;
> > +			goto out_err;
> >  	}
> >  
> >  	if (!irqchip_in_kernel(kvm)) {
> > @@ -869,6 +869,10 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
> >  	mutex_unlock(&kvm->arch.config_lock);
> >  
> >  	return ret;
> > +
> > +out_err:
> > +	kvm_vm_dead(kvm);
> > +	return ret;
> >  }
> 
> After rereading, I think we could benefit from a more distinct separation
> of late VM vs. vCPU state initialization.
> 
> Bugging the VM is a big hammer, we should probably only resort to that
> when the VM state is screwed up badly.
>
> Otherwise, for screwed up vCPU state we could uninitialize the vCPU and
> let userspace try again. An example of this is how we deal with VMs that
> run 32 bit userspace when KVM tries to hide the feature.

I tend to agree. We shouldn't make a misconfiguration fatal unless we
know for sure that the state is not recoverable (such as a screwed up
memory map).

When it comes to this particular issue, I wonder why the (maybe overly
simplistic) hack below isn't enough. The userspace_irqchip_in_use
static key brings more problems than it is worth it, and the feature
is mostly nonsense anyway.

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index bf64fed9820e..c315bc1a4e9a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -74,8 +74,6 @@ enum kvm_mode kvm_get_mode(void);
 static inline enum kvm_mode kvm_get_mode(void) { return KVM_MODE_NONE; };
 #endif
 
-DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
-
 extern unsigned int __ro_after_init kvm_sve_max_vl;
 extern unsigned int __ro_after_init kvm_host_sve_max_vl;
 int __init kvm_arm_init_sve(void);
diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index 879982b1cc73..1215df590418 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -206,8 +206,7 @@ void get_timer_map(struct kvm_vcpu *vcpu, struct timer_map *map)
 
 static inline bool userspace_irqchip(struct kvm *kvm)
 {
-	return static_branch_unlikely(&userspace_irqchip_in_use) &&
-		unlikely(!irqchip_in_kernel(kvm));
+	return unlikely(!irqchip_in_kernel(kvm));
 }
 
 static void soft_timer_start(struct hrtimer *hrt, u64 ns)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 48cafb65d6ac..70ff9a20ef3a 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -69,7 +69,6 @@ DECLARE_KVM_NVHE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt);
 static bool vgic_present, kvm_arm_initialised;
 
 static DEFINE_PER_CPU(unsigned char, kvm_hyp_initialized);
-DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
 
 bool is_kvm_arm_initialised(void)
 {
@@ -503,9 +502,6 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
-	if (vcpu_has_run_once(vcpu) && unlikely(!irqchip_in_kernel(vcpu->kvm)))
-		static_branch_dec(&userspace_irqchip_in_use);
-
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache);
 	kvm_timer_vcpu_terminate(vcpu);
 	kvm_pmu_vcpu_destroy(vcpu);
@@ -848,14 +844,6 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
 			return ret;
 	}
 
-	if (!irqchip_in_kernel(kvm)) {
-		/*
-		 * Tell the rest of the code that there are userspace irqchip
-		 * VMs in the wild.
-		 */
-		static_branch_inc(&userspace_irqchip_in_use);
-	}
-
 	/*
 	 * Initialize traps for protected VMs.
 	 * NOTE: Move to run in EL2 directly, rather than via a hypercall, once
@@ -1077,7 +1065,7 @@ static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret)
 	 * state gets updated in kvm_timer_update_run and
 	 * kvm_pmu_update_run below).
 	 */
-	if (static_branch_unlikely(&userspace_irqchip_in_use)) {
+	if (unlikely(!irqchip_in_kernel(vcpu->kvm))) {
 		if (kvm_timer_should_notify_user(vcpu) ||
 		    kvm_pmu_should_notify_user(vcpu)) {
 			*ret = -EINTR;
@@ -1199,7 +1187,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 			vcpu->mode = OUTSIDE_GUEST_MODE;
 			isb(); /* Ensure work in x_flush_hwstate is committed */
 			kvm_pmu_sync_hwstate(vcpu);
-			if (static_branch_unlikely(&userspace_irqchip_in_use))
+			if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
 				kvm_timer_sync_user(vcpu);
 			kvm_vgic_sync_hwstate(vcpu);
 			local_irq_enable();
@@ -1245,7 +1233,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		 * we don't want vtimer interrupts to race with syncing the
 		 * timer virtual interrupt state.
 		 */
-		if (static_branch_unlikely(&userspace_irqchip_in_use))
+		if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
 			kvm_timer_sync_user(vcpu);
 
 		kvm_arch_vcpu_ctxsync_fp(vcpu);

I think this would fix the problem you're seeing without changing the
userspace view of an erroneous configuration. It would also pave the
way for the complete removal of the interrupt notification to
userspace, which I claim has no user and is just a shit idea.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.