On Fri, Apr 19, 2024, Kai Huang wrote: > On 19/04/2024 2:30 am, Sean Christopherson wrote: > > No, that will deadlock as cpuhp_setup_state() does cpus_read_lock(). > > Right, but it takes cpus_read_lock()/unlock() internally. I was talking > about: > > if (enable_tdx) { > kvm_x86_virtualization_enable(); > > /* > * Unfortunately currently tdx_enable() internally has > * lockdep_assert_cpus_held(). > */ > cpus_read_lock(); > tdx_enable(); > cpus_read_unlock(); > } Ah. Just have tdx_enable() do cpus_read_lock(), I suspect/assume the current implemention was purely done in anticipation of KVM "needing" to do tdx_enable() while holding cpu_hotplug_lock. And tdx_enable() should also do its best to verify that the caller is post-VMXON: if (WARN_ON_ONCE(!(__read_cr4() & X86_CR4_VMXE))) return -EINVAL; > > > Btw, why couldn't we do the 'system_state' check at the very beginning of > > > this function? > > > > We could, but we'd still need to check after, and adding a small bit of extra > > complexity just to try to catch a very rare situation isn't worth it. > > > > To prevent races, system_state needs to be check after register_syscore_ops(), > > because only once kvm_syscore_ops is registered is KVM guaranteed to get notified > > of a shutdown. > > > And because the kvm_syscore_ops hooks disable virtualization, they should be called > > after cpuhp_setup_state(). That's not strictly required, as the per-CPU > > hardware_enabled flag will prevent true problems if the system enter shutdown > > state before KVM reaches cpuhp_setup_state(). > > > > Hmm, but the same edge cases exists in the above flow. If the system enters > > shutdown _just_ after register_syscore_ops(), KVM would see that in system_state > > and do cpuhp_remove_state(), i.e. invoke kvm_offline_cpu() and thus do a double > > disable (which again is benign because of hardware_enabled). > > > > Ah, but registering syscore ops before doing cpuhp_setup_state() has another race, > > and one that could be fatal. If the system does suspend+resume before the cpuhup > > hooks are registered, kvm_resume() would enable virtualization. And then if > > cpuhp_setup_state() failed, virtualization would be left enabled. > > > > So cpuhp_setup_state() *must* come before register_syscore_ops(), and > > register_syscore_ops() *must* come before the system_state check. > > OK. I guess I have to double check here to completely understand the races. > :-) > > So I think we have consensus to go with the approach that shows in your > second diff -- that is to always enable virtualization during module loading > for all other ARCHs other than x86, for which we only always enables > virtualization during module loading for TDX. Assuming the other arch maintainers are ok with that approach. If waiting until a VM is created is desirable for other architectures, then we'll need to figure out a plan b. E.g. KVM arm64 doesn't support being built as a module, so enabling hardware during initialization would mean virtualization is enabled for any kernel that is built with CONFIG_KVM=y. Actually, duh. There's absolutely no reason to force other architectures to choose when to enable virtualization. As evidenced by the massaging to have x86 keep enabling virtualization on-demand for !TDX, the cleanups don't come from enabling virtualization during module load, they come from registering cpuup and syscore ops when virtualization is enabled. I.e. we can keep kvm_usage_count in common code, and just do exactly what I proposed for kvm_x86_enable_virtualization(). I have patches to do this, and initial testing suggests they aren't wildly broken. I'll post them soon-ish, assuming nothing pops up in testing. They are clean enough that they can land in advance of TDX, e.g. in kvm-coco-queue even before other architectures verify I didn't break them. > Then how about "do kvm_x86_virtualization_enable() within > late_hardware_setup() in kvm_x86_vendor_init()" vs "do > kvm_x86_virtualization_enable() in TDX-specific code after > kvm_x86_vendor_init()"? > > Which do you prefer? The latter, assuming it doesn't make the TDX code more complex than it needs to be. The fewer kvm_x86_ops hooks, the better.