On 19/04/2024 2:30 am, Sean Christopherson wrote:
On Thu, Apr 18, 2024, Kai Huang wrote:
On 18/04/2024 11:35 am, Sean Christopherson wrote:
Ah, yeah. Oh, duh. I think the reason I didn't initially suggest late_hardware_setup()
is that I was assuming/hoping TDX setup could be done after kvm_x86_vendor_exit().
E.g. in vt_init() or whatever it gets called:
r = kvm_x86_vendor_exit(...);
if (r)
return r;
if (enable_tdx) {
r = tdx_blah_blah_blah();
if (r)
goto vendor_exit;
}
I assume the reason you introduced the late_hardware_setup() is purely
because you want to do:
cpu_emergency_register_virt_callback(kvm_x86_ops.emergency_enable);
after
kvm_ops_update()?
No, kvm_ops_update() needs to come before kvm_x86_enable_virtualization(), as the
static_call() to hardware_enable() needs to be patched in.
Right. I was talking about that the reason you introduced the
late_hardware_setup() was because we need to do
kvm_x86_virtualization_enabled() and the above
cpu_emergency_register_virt_callback() after kvm_ops_update().
Oh, and my adjust patch is broken, the code to do the compat checks should NOT
be removed; it could be removed if KVM unconditionally enabled VMX during setup,
but it needs to stay in the !TDX case.
Right.
- for_each_online_cpu(cpu) {
- smp_call_function_single(cpu, kvm_x86_check_cpu_compat, &r, 1);
- if (r < 0)
- goto out_unwind_ops;
- }
Which is another reason to defer kvm_x86_enable_virtualization(), though to be
honest not a particularly compelling reason on its own.
Anyway, we can also do 'enable_tdx' outside of kvm_x86_vendor_init() as
above, given it cannot be done in hardware_setup() anyway.
If we do 'enable_tdx' in late_hardware_setup(), we will need a
kvm_x86_enable_virtualization_nolock(), but that's also not a problem to me.
So which way do you prefer?
Btw, with kvm_x86_virtualization_enable(), it seems the compatibility check
is lost, which I assume is OK?
Heh, and I obviously wasn't reading ahead :-)
Btw2, currently tdx_enable() requires cpus_read_lock() must be called prior.
If we do unconditional tdx_cpu_enable() in vt_hardware_enable(), then with
your proposal IIUC there's no such requirement anymore, because no task will
be scheduled to the new CPU before it reaches CPUHP_AP_ACTIVE.
Correct.
But now calling cpus_read_lock()/unlock() around tdx_enable() also acceptable
to me.
No, that will deadlock as cpuhp_setup_state() does cpus_read_lock().
Right, but it takes cpus_read_lock()/unlock() internally. I was talking
about:
if (enable_tdx) {
kvm_x86_virtualization_enable();
/*
* Unfortunately currently tdx_enable() internally has
* lockdep_assert_cpus_held().
*/
cpus_read_lock();
tdx_enable();
cpus_read_unlock();
}
+int kvm_enable_virtualization(void)
{
+ int r;
+
+ r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
+ kvm_online_cpu, kvm_offline_cpu);
+ if (r)
+ return r;
+
+ register_syscore_ops(&kvm_syscore_ops);
+
+ /*
+ * Manually undo virtualization enabling if the system is going down.
+ * If userspace initiated a forced reboot, e.g. reboot -f, then it's
+ * possible for an in-flight module load to enable virtualization
+ * after syscore_shutdown() is called, i.e. without kvm_shutdown()
+ * being invoked. Note, this relies on system_state being set _before_
+ * kvm_shutdown(), e.g. to ensure either kvm_shutdown() is invoked
+ * or this CPU observes the impedning shutdown. Which is why KVM uses
+ * a syscore ops hook instead of registering a dedicated reboot
+ * notifier (the latter runs before system_state is updated).
+ */
+ if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
+ system_state == SYSTEM_RESTART) {
+ unregister_syscore_ops(&kvm_syscore_ops);
+ cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
+ return -EBUSY;
+ }
+
Aren't we also supposed to do:
on_each_cpu(__kvm_enable_virtualization, NULL, 1);
here?
No, cpuhp_setup_state() invokes the callback, kvm_online_cpu(), on each CPU.
I.e. KVM has been doing things the hard way by using cpuhp_setup_state_nocalls().
That's part of the complexity I would like to get rid of.
Ah, right :-)
Btw, why couldn't we do the 'system_state' check at the very beginning of
this function?
We could, but we'd still need to check after, and adding a small bit of extra
complexity just to try to catch a very rare situation isn't worth it.
To prevent races, system_state needs to be check after register_syscore_ops(),
because only once kvm_syscore_ops is registered is KVM guaranteed to get notified
of a shutdown. >
And because the kvm_syscore_ops hooks disable virtualization, they should be called
after cpuhp_setup_state(). That's not strictly required, as the per-CPU
hardware_enabled flag will prevent true problems if the system enter shutdown
state before KVM reaches cpuhp_setup_state().
Hmm, but the same edge cases exists in the above flow. If the system enters
shutdown _just_ after register_syscore_ops(), KVM would see that in system_state
and do cpuhp_remove_state(), i.e. invoke kvm_offline_cpu() and thus do a double
disable (which again is benign because of hardware_enabled).
Ah, but registering syscore ops before doing cpuhp_setup_state() has another race,
and one that could be fatal. If the system does suspend+resume before the cpuhup
hooks are registered, kvm_resume() would enable virtualization. And then if
cpuhp_setup_state() failed, virtualization would be left enabled.
So cpuhp_setup_state() *must* come before register_syscore_ops(), and
register_syscore_ops() *must* come before the system_state check.
OK. I guess I have to double check here to completely understand the
races. :-)
So I think we have consensus to go with the approach that shows in your
second diff -- that is to always enable virtualization during module
loading for all other ARCHs other than x86, for which we only always
enables virtualization during module loading for TDX.
Then how about "do kvm_x86_virtualization_enable() within
late_hardware_setup() in kvm_x86_vendor_init()" vs "do
kvm_x86_virtualization_enable() in TDX-specific code after
kvm_x86_vendor_init()"?
Which do you prefer?