On Tue, 2024-05-07 at 09:31 -0700, Sean Christopherson wrote: > On Thu, Apr 25, 2024, Sean Christopherson wrote: > > Register KVM's cpuhp and syscore callback when enabling virtualization > > in hardware instead of registering the callbacks during initialization, > > and let the CPU up/down framework invoke the inner enable/disable > > functions. Registering the callbacks during initialization makes things > > more complex than they need to be, as KVM needs to be very careful about > > handling races between enabling CPUs being onlined/offlined and hardware > > being enabled/disabled. > > > > Intel TDX support will require KVM to enable virtualization during KVM > > initialization, i.e. will add another wrinkle to things, at which point > > sorting out the potential races with kvm_usage_count would become even > > more complex. > > +static int hardware_enable_all(void) > > +{ > > + int r; > > + > > + guard(mutex)(&kvm_lock); > > + > > + if (kvm_usage_count++) > > + return 0; > > + > > + r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online", > > + kvm_online_cpu, kvm_offline_cpu); > > + if (r) > > + return r; > > There's a lock ordering issue here. KVM currently takes kvm_lock inside > cpu_hotplug_lock, but this code does the opposite. I need to take a closer look > at the locking, as I'm not entirely certain that the existing ordering is correct > or ideal. > Do you mean currently (upstream) hardware_enable_all() takes cpus_read_lock() first and then kvm_lock? For this one I think the cpus_read_lock() must be taken outside of kvm_lock, because the kvm_online_cpu() also takes kvm_lock. Switching the order in hardware_enable_all() can result in deadlock. For example, when CPU 0 is doing hardware_enable_all(), CPU 1 tries to bring up CPU 2 between kvm_lock and cpus_read_lock() in CPU 0: cpu 0 cpu 1 cpu 2 (hardware_enable_all()) (online cpu 2) (kvm_online_cpu()) 1) mutex_lock(&kvm_lock); 2) cpus_write_lock(); bringup cpu 2 4) mutex_lock(&kvm_lock); 3) cpus_read_lock(); ... mutex_unlock(&kvm_lock); 5) cpus_write_unlock(); ... 6) mutex_unlock(&kvm_lock); In this case, the cpus_read_lock() in step 3) will wait for the cpus_write_unlock() in step 5) to complete, which will wait for CPU 2 to complete kvm_online_cpu(). But kvm_online_cpu() on CPU 2 will in turn wait for CPU 0 to release the kvm_lock, so deadlock. But with the code change in this patch, the kvm_online_cpu() doesn't take the kvm_lock anymore, so to me it looks it's OK to take cpus_read_lock() inside kvm_lock. Btw, even in the current upstream code, IIUC the cpus_read_lock() isn't absolutely necessary. It was introduced to prevent running hardware_enable_nolock() from on_each_cpu() IPI call for the new cpu before kvm_online_cpu() is invoked. But due to both hardware_enable_all() and kvm_online_cpu() both grabs kvm_lock, the hardware_enable_nolock() inside the kvm_online_cpu() will always wait for hardware_enable_all() to complete, so the worst case is hardware_enable_nolock() is called twice. But this is fine because the second call will basically do nothing due to the @hardware_enabled per-cpu variable. > E.g. cpu_hotplug_lock is taken when updating static keys, static calls, > etc., which makes taking cpu_hotplug_lock outside kvm_lock dicey, as flows that > take kvm_lock then need to be very careful to never trigger seemingly innocuous > updates. > > And this lockdep splat that I've now hit twice with the current implementation > suggests that cpu_hotplug_lock => kvm_lock is already unsafe/broken (I need to > re-decipher the splat; I _think_ mostly figured it out last week, but then forgot > over the weekend). I think if we remove the kvm_lock in kvm_online_cpu(), it's OK to hold cpus_read_lock() (call the cpuhp_setup_state()) inside the kvm_lock. If so, maybe we can just have a rule that cpus_read_lock() cannot be hold outside of kvm_lock.