On Thu, Apr 18, 2024, Kai Huang wrote: > On 18/04/2024 11:35 am, Sean Christopherson wrote: > > Ah, yeah. Oh, duh. I think the reason I didn't initially suggest late_hardware_setup() > > is that I was assuming/hoping TDX setup could be done after kvm_x86_vendor_exit(). > > E.g. in vt_init() or whatever it gets called: > > > > r = kvm_x86_vendor_exit(...); > > if (r) > > return r; > > > > if (enable_tdx) { > > r = tdx_blah_blah_blah(); > > if (r) > > goto vendor_exit; > > } > > > I assume the reason you introduced the late_hardware_setup() is purely > because you want to do: > > cpu_emergency_register_virt_callback(kvm_x86_ops.emergency_enable); > > after > > kvm_ops_update()? No, kvm_ops_update() needs to come before kvm_x86_enable_virtualization(), as the static_call() to hardware_enable() needs to be patched in. Oh, and my adjust patch is broken, the code to do the compat checks should NOT be removed; it could be removed if KVM unconditionally enabled VMX during setup, but it needs to stay in the !TDX case. - for_each_online_cpu(cpu) { - smp_call_function_single(cpu, kvm_x86_check_cpu_compat, &r, 1); - if (r < 0) - goto out_unwind_ops; - } Which is another reason to defer kvm_x86_enable_virtualization(), though to be honest not a particularly compelling reason on its own. > Anyway, we can also do 'enable_tdx' outside of kvm_x86_vendor_init() as > above, given it cannot be done in hardware_setup() anyway. > > If we do 'enable_tdx' in late_hardware_setup(), we will need a > kvm_x86_enable_virtualization_nolock(), but that's also not a problem to me. > > So which way do you prefer? > > Btw, with kvm_x86_virtualization_enable(), it seems the compatibility check > is lost, which I assume is OK? Heh, and I obviously wasn't reading ahead :-) > Btw2, currently tdx_enable() requires cpus_read_lock() must be called prior. > If we do unconditional tdx_cpu_enable() in vt_hardware_enable(), then with > your proposal IIUC there's no such requirement anymore, because no task will > be scheduled to the new CPU before it reaches CPUHP_AP_ACTIVE. Correct. > But now calling cpus_read_lock()/unlock() around tdx_enable() also acceptable > to me. No, that will deadlock as cpuhp_setup_state() does cpus_read_lock(). > > > > +int kvm_enable_virtualization(void) > > > > { > > > > + int r; > > > > + > > > > + r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online", > > > > + kvm_online_cpu, kvm_offline_cpu); > > > > + if (r) > > > > + return r; > > > > + > > > > + register_syscore_ops(&kvm_syscore_ops); > > > > + > > > > + /* > > > > + * Manually undo virtualization enabling if the system is going down. > > > > + * If userspace initiated a forced reboot, e.g. reboot -f, then it's > > > > + * possible for an in-flight module load to enable virtualization > > > > + * after syscore_shutdown() is called, i.e. without kvm_shutdown() > > > > + * being invoked. Note, this relies on system_state being set _before_ > > > > + * kvm_shutdown(), e.g. to ensure either kvm_shutdown() is invoked > > > > + * or this CPU observes the impedning shutdown. Which is why KVM uses > > > > + * a syscore ops hook instead of registering a dedicated reboot > > > > + * notifier (the latter runs before system_state is updated). > > > > + */ > > > > + if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF || > > > > + system_state == SYSTEM_RESTART) { > > > > + unregister_syscore_ops(&kvm_syscore_ops); > > > > + cpuhp_remove_state(CPUHP_AP_KVM_ONLINE); > > > > + return -EBUSY; > > > > + } > > > > + > > > > > > Aren't we also supposed to do: > > > > > > on_each_cpu(__kvm_enable_virtualization, NULL, 1); > > > > > > here? > > > > No, cpuhp_setup_state() invokes the callback, kvm_online_cpu(), on each CPU. > > I.e. KVM has been doing things the hard way by using cpuhp_setup_state_nocalls(). > > That's part of the complexity I would like to get rid of. > > Ah, right :-) > > Btw, why couldn't we do the 'system_state' check at the very beginning of > this function? We could, but we'd still need to check after, and adding a small bit of extra complexity just to try to catch a very rare situation isn't worth it. To prevent races, system_state needs to be check after register_syscore_ops(), because only once kvm_syscore_ops is registered is KVM guaranteed to get notified of a shutdown. And because the kvm_syscore_ops hooks disable virtualization, they should be called after cpuhp_setup_state(). That's not strictly required, as the per-CPU hardware_enabled flag will prevent true problems if the system enter shutdown state before KVM reaches cpuhp_setup_state(). Hmm, but the same edge cases exists in the above flow. If the system enters shutdown _just_ after register_syscore_ops(), KVM would see that in system_state and do cpuhp_remove_state(), i.e. invoke kvm_offline_cpu() and thus do a double disable (which again is benign because of hardware_enabled). Ah, but registering syscore ops before doing cpuhp_setup_state() has another race, and one that could be fatal. If the system does suspend+resume before the cpuhup hooks are registered, kvm_resume() would enable virtualization. And then if cpuhp_setup_state() failed, virtualization would be left enabled. So cpuhp_setup_state() *must* come before register_syscore_ops(), and register_syscore_ops() *must* come before the system_state check.