Re: [PATCH 3/4] KVM: Register cpuhp and syscore callbacks when enabling hardware

"Huang, Kai" <kai.huang@xxxxxxxxx> · Thu, 9 May 2024 12:10:56 +0000

On Tue, 2024-05-07 at 09:31 -0700, Sean Christopherson wrote:
> On Thu, Apr 25, 2024, Sean Christopherson wrote:
> > Register KVM's cpuhp and syscore callback when enabling virtualization
> > in hardware instead of registering the callbacks during initialization,
> > and let the CPU up/down framework invoke the inner enable/disable
> > functions.  Registering the callbacks during initialization makes things
> > more complex than they need to be, as KVM needs to be very careful about
> > handling races between enabling CPUs being onlined/offlined and hardware
> > being enabled/disabled.
> > 
> > Intel TDX support will require KVM to enable virtualization during KVM
> > initialization, i.e. will add another wrinkle to things, at which point
> > sorting out the potential races with kvm_usage_count would become even
> > more complex.
> > +static int hardware_enable_all(void)
> > +{
> > +	int r;
> > +
> > +	guard(mutex)(&kvm_lock);
> > +
> > +	if (kvm_usage_count++)
> > +		return 0;
> > +
> > +	r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
> > +			      kvm_online_cpu, kvm_offline_cpu);
> > +	if (r)
> > +		return r;
> 
> There's a lock ordering issue here.  KVM currently takes kvm_lock inside
> cpu_hotplug_lock, but this code does the opposite.  I need to take a closer look
> at the locking, as I'm not entirely certain that the existing ordering is correct
> or ideal.  
> 

Do you mean currently (upstream) hardware_enable_all() takes
cpus_read_lock() first and then kvm_lock?

For this one I think the cpus_read_lock() must be taken outside of
kvm_lock, because the kvm_online_cpu() also takes kvm_lock.  Switching the
order in hardware_enable_all() can result in deadlock.

For example, when CPU 0 is doing hardware_enable_all(), CPU 1 tries to
bring up CPU 2 between kvm_lock and cpus_read_lock() in CPU 0:

cpu 0 			   cpu 1 		cpu 2

(hardware_enable_all())	   (online cpu 2)	(kvm_online_cpu())

1) mutex_lock(&kvm_lock);	   

			   2) cpus_write_lock();
			      bringup cpu 2

						4) mutex_lock(&kvm_lock);

3) cpus_read_lock();				...

						mutex_unlock(&kvm_lock);

			   5) cpus_write_unlock();

   ...

6) mutex_unlock(&kvm_lock);

In this case, the cpus_read_lock() in step 3) will wait for the
cpus_write_unlock() in step 5) to complete, which will wait for CPU 2 to
complete kvm_online_cpu().  But kvm_online_cpu() on CPU 2 will in turn
wait for CPU 0 to release the kvm_lock, so deadlock.

But with the code change in this patch, the kvm_online_cpu() doesn't take
the kvm_lock anymore, so to me it looks it's OK to take cpus_read_lock()
inside kvm_lock.

Btw, even in the current upstream code, IIUC the cpus_read_lock() isn't
absolutely necessary.  It was introduced to prevent running
hardware_enable_nolock() from on_each_cpu() IPI call for the new cpu
before kvm_online_cpu() is invoked.  But due to both hardware_enable_all()
and kvm_online_cpu() both grabs kvm_lock, the hardware_enable_nolock()
inside the kvm_online_cpu() will always wait for hardware_enable_all() to
complete, so the worst case is hardware_enable_nolock() is called twice. 
But this is fine because the second call will basically do nothing due to
the @hardware_enabled per-cpu variable.

> E.g. cpu_hotplug_lock is taken when updating static keys, static calls,
> etc., which makes taking cpu_hotplug_lock outside kvm_lock dicey, as flows that
> take kvm_lock then need to be very careful to never trigger seemingly innocuous
> updates.
> 
> And this lockdep splat that I've now hit twice with the current implementation
> suggests that cpu_hotplug_lock => kvm_lock is already unsafe/broken (I need to
> re-decipher the splat; I _think_ mostly figured it out last week, but then forgot
> over the weekend).

I think if we remove the kvm_lock in kvm_online_cpu(), it's OK to hold
cpus_read_lock() (call the cpuhp_setup_state()) inside the kvm_lock.

If so, maybe we can just have a rule that cpus_read_lock() cannot be hold
outside of kvm_lock.