Re: [PATCH v19 023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module

Sean Christopherson <seanjc@xxxxxxxxxx> · Thu, 18 Apr 2024 07:30:46 -0700

On Thu, Apr 18, 2024, Kai Huang wrote:
> On 18/04/2024 11:35 am, Sean Christopherson wrote:
> > Ah, yeah.  Oh, duh.  I think the reason I didn't initially suggest late_hardware_setup()
> > is that I was assuming/hoping TDX setup could be done after kvm_x86_vendor_exit().
> > E.g. in vt_init() or whatever it gets called:
> > 
> > 	r = kvm_x86_vendor_exit(...);
> > 	if (r)
> > 		return r;
> > 
> > 	if (enable_tdx) {
> > 		r = tdx_blah_blah_blah();
> > 		if (r)
> > 			goto vendor_exit;
> > 	}
> 
> 
> I assume the reason you introduced the late_hardware_setup() is purely
> because you want to do:
> 
>   cpu_emergency_register_virt_callback(kvm_x86_ops.emergency_enable);
> 
> after
> 
>   kvm_ops_update()?

No, kvm_ops_update() needs to come before kvm_x86_enable_virtualization(), as the
static_call() to hardware_enable() needs to be patched in.

Oh, and my adjust patch is broken, the code to do the compat checks should NOT
be removed; it could be removed if KVM unconditionally enabled VMX during setup,
but it needs to stay in the !TDX case.

-       for_each_online_cpu(cpu) {
-               smp_call_function_single(cpu, kvm_x86_check_cpu_compat, &r, 1);
-               if (r < 0)
-                       goto out_unwind_ops;
-       }

Which is another reason to defer kvm_x86_enable_virtualization(), though to be
honest not a particularly compelling reason on its own.

> Anyway, we can also do 'enable_tdx' outside of kvm_x86_vendor_init() as
> above, given it cannot be done in hardware_setup() anyway.
> 
> If we do 'enable_tdx' in late_hardware_setup(), we will need a
> kvm_x86_enable_virtualization_nolock(), but that's also not a problem to me.
> 
> So which way do you prefer?
> 
> Btw, with kvm_x86_virtualization_enable(), it seems the compatibility check
> is lost, which I assume is OK?

Heh, and I obviously wasn't reading ahead :-)

> Btw2, currently tdx_enable() requires cpus_read_lock() must be called prior.
> If we do unconditional tdx_cpu_enable() in vt_hardware_enable(), then with
> your proposal IIUC there's no such requirement anymore, because no task will
> be scheduled to the new CPU before it reaches CPUHP_AP_ACTIVE.

Correct.

> But now calling cpus_read_lock()/unlock() around tdx_enable() also acceptable
> to me.

No, that will deadlock as cpuhp_setup_state() does cpus_read_lock().

> > > > +int kvm_enable_virtualization(void)
> > > >    {
> > > > +	int r;
> > > > +
> > > > +	r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online",
> > > > +			      kvm_online_cpu, kvm_offline_cpu);
> > > > +	if (r)
> > > > +		return r;
> > > > +
> > > > +	register_syscore_ops(&kvm_syscore_ops);
> > > > +
> > > > +	/*
> > > > +	 * Manually undo virtualization enabling if the system is going down.
> > > > +	 * If userspace initiated a forced reboot, e.g. reboot -f, then it's
> > > > +	 * possible for an in-flight module load to enable virtualization
> > > > +	 * after syscore_shutdown() is called, i.e. without kvm_shutdown()
> > > > +	 * being invoked.  Note, this relies on system_state being set _before_
> > > > +	 * kvm_shutdown(), e.g. to ensure either kvm_shutdown() is invoked
> > > > +	 * or this CPU observes the impedning shutdown.  Which is why KVM uses
> > > > +	 * a syscore ops hook instead of registering a dedicated reboot
> > > > +	 * notifier (the latter runs before system_state is updated).
> > > > +	 */
> > > > +	if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF ||
> > > > +	    system_state == SYSTEM_RESTART) {
> > > > +		unregister_syscore_ops(&kvm_syscore_ops);
> > > > +		cpuhp_remove_state(CPUHP_AP_KVM_ONLINE);
> > > > +		return -EBUSY;
> > > > +	}
> > > > +
> > > 
> > > Aren't we also supposed to do:
> > > 
> > > 	on_each_cpu(__kvm_enable_virtualization, NULL, 1);
> > > 
> > > here?
> > 
> > No, cpuhp_setup_state() invokes the callback, kvm_online_cpu(), on each CPU.
> > I.e. KVM has been doing things the hard way by using cpuhp_setup_state_nocalls().
> > That's part of the complexity I would like to get rid of.
> 
> Ah, right :-)
> 
> Btw, why couldn't we do the 'system_state' check at the very beginning of
> this function?

We could, but we'd still need to check after, and adding a small bit of extra
complexity just to try to catch a very rare situation isn't worth it.

To prevent races, system_state needs to be check after register_syscore_ops(),
because only once kvm_syscore_ops is registered is KVM guaranteed to get notified
of a shutdown.

And because the kvm_syscore_ops hooks disable virtualization, they should be called
after cpuhp_setup_state().  That's not strictly required, as the per-CPU
hardware_enabled flag will prevent true problems if the system enter shutdown
state before KVM reaches cpuhp_setup_state().

Hmm, but the same edge cases exists in the above flow.  If the system enters
shutdown _just_ after register_syscore_ops(), KVM would see that in system_state
and do cpuhp_remove_state(), i.e. invoke kvm_offline_cpu() and thus do a double
disable (which again is benign because of hardware_enabled).

Ah, but registering syscore ops before doing cpuhp_setup_state() has another race,
and one that could be fatal.  If the system does suspend+resume before the cpuhup
hooks are registered, kvm_resume() would enable virtualization.  And then if
cpuhp_setup_state() failed, virtualization would be left enabled.

So cpuhp_setup_state() *must* come before register_syscore_ops(), and
register_syscore_ops() *must* come before the system_state check.