On Fri, 2024-04-19 at 10:23 -0700, Sean Christopherson wrote: > On Fri, Apr 19, 2024, Kai Huang wrote: > > On 19/04/2024 2:30 am, Sean Christopherson wrote: > > > No, that will deadlock as cpuhp_setup_state() does cpus_read_lock(). > > > > Right, but it takes cpus_read_lock()/unlock() internally. I was talking > > about: > > > > if (enable_tdx) { > > kvm_x86_virtualization_enable(); > > > > /* > > * Unfortunately currently tdx_enable() internally has > > * lockdep_assert_cpus_held(). > > */ > > cpus_read_lock(); > > tdx_enable(); > > cpus_read_unlock(); > > } > > Ah. Just have tdx_enable() do cpus_read_lock(), I suspect/assume the current > implemention was purely done in anticipation of KVM "needing" to do tdx_enable() > while holding cpu_hotplug_lock. Yeah. It was implemented based on the assumption that KVM would do below sequence in hardware_setup(): cpus_read_lock(); on_each_cpu(vmxon_and_tdx_cpu_enable, NULL, 1); tdx_enable(); on_each_cpu(vmxoff, NULL, 1); cpus_read_unlock(); > > And tdx_enable() should also do its best to verify that the caller is post-VMXON: > > if (WARN_ON_ONCE(!(__read_cr4() & X86_CR4_VMXE))) > return -EINVAL; This won't be helpful, or at least isn't sufficient. tdx_enable() can SEAMCALLs on all online CPUs, so checking "the caller is post-VMXON" isn't enough. It needs "checking all online CPUs are in post- VMXON with tdx_cpu_enable() having been done". I didn't add such check because it's not mandatory, i.e., the later SEAMCALL will catch such violation. Btw, I noticed there's another problem, that is currently tdx_cpu_enable() actually requires IRQ being disabled. Again it was implemented based on it would be invoked via both on_each_cpu() and kvm_online_cpu(). It also also implemented with consideration that it could be called by multiple in-kernel TDX users in parallel via both SMP call and in normal context, so it was implemented to simply request the caller to make sure it is called with IRQ disabled so it can be IRQ safe (it uses a percpu variable to track whether TDH.SYS.LP.INIT has been done for local cpu similar to the hardware_enabled percpu variable). > > > > > Btw, why couldn't we do the 'system_state' check at the very beginning of > > > > this function? > > > > > > We could, but we'd still need to check after, and adding a small bit of extra > > > complexity just to try to catch a very rare situation isn't worth it. > > > > > > To prevent races, system_state needs to be check after register_syscore_ops(), > > > because only once kvm_syscore_ops is registered is KVM guaranteed to get notified > > > of a shutdown. > > > > And because the kvm_syscore_ops hooks disable virtualization, they should be called > > > after cpuhp_setup_state(). That's not strictly required, as the per-CPU > > > hardware_enabled flag will prevent true problems if the system enter shutdown > > > state before KVM reaches cpuhp_setup_state(). > > > > > > Hmm, but the same edge cases exists in the above flow. If the system enters > > > shutdown _just_ after register_syscore_ops(), KVM would see that in system_state > > > and do cpuhp_remove_state(), i.e. invoke kvm_offline_cpu() and thus do a double > > > disable (which again is benign because of hardware_enabled). > > > > > > Ah, but registering syscore ops before doing cpuhp_setup_state() has another race, > > > and one that could be fatal. If the system does suspend+resume before the cpuhup > > > hooks are registered, kvm_resume() would enable virtualization. And then if > > > cpuhp_setup_state() failed, virtualization would be left enabled. > > > > > > So cpuhp_setup_state() *must* come before register_syscore_ops(), and > > > register_syscore_ops() *must* come before the system_state check. > > > > OK. I guess I have to double check here to completely understand the races. > > :-) > > > > So I think we have consensus to go with the approach that shows in your > > second diff -- that is to always enable virtualization during module loading > > for all other ARCHs other than x86, for which we only always enables > > virtualization during module loading for TDX. > > Assuming the other arch maintainers are ok with that approach. If waiting until > a VM is created is desirable for other architectures, then we'll need to figure > out a plan b. E.g. KVM arm64 doesn't support being built as a module, so enabling > hardware during initialization would mean virtualization is enabled for any kernel > that is built with CONFIG_KVM=y. > > Actually, duh. There's absolutely no reason to force other architectures to > choose when to enable virtualization. As evidenced by the massaging to have x86 > keep enabling virtualization on-demand for !TDX, the cleanups don't come from > enabling virtualization during module load, they come from registering cpuup and > syscore ops when virtualization is enabled. > > I.e. we can keep kvm_usage_count in common code, and just do exactly what I > proposed for kvm_x86_enable_virtualization(). If so, then looks this is basically changing "cpuhp_setup_state_nocalls() + on_each_cpu()" to "cpuhp_setup_state()", and moving it along with register_syscore_ops() to hardware_enable_all()"? > > I have patches to do this, and initial testing suggests they aren't wildly > broken. I'll post them soon-ish, assuming nothing pops up in testing. They are > clean enough that they can land in advance of TDX, e.g. in kvm-coco-queue even > before other architectures verify I didn't break them. Good to know. I'll do more TDX test with them after you send them out. > > > Then how about "do kvm_x86_virtualization_enable() within > > late_hardware_setup() in kvm_x86_vendor_init()" vs "do > > kvm_x86_virtualization_enable() in TDX-specific code after > > kvm_x86_vendor_init()"? > > > > Which do you prefer? > > The latter, assuming it doesn't make the TDX code more complex than it needs to > be. The fewer kvm_x86_ops hooks, the better. > Agreed.