Re: [PATCH v8 003/103] KVM: Refactor CPU compatibility check on module initialization

"Huang, Kai" <kai.huang@xxxxxxxxx> · Fri, 12 Aug 2022 11:35:29 +0000

On Thu, 2022-08-11 at 17:39 +0000, Sean Christopherson wrote:
> +Will (for arm crud)
> 
> On Thu, Aug 11, 2022, Huang, Kai wrote:
> > First of all, I think the patch title can be improved.  "refactor CPU
> > compatibility check on module initialization" isn't the purpose of this patch. 
> > It is just a bonus.  The title should reflect the main purpose (or behaviour) of
> > this patch:
> > 
> > 	KVM: Temporarily enable hardware on all cpus during module loading time
> 
> ...
> 
> > > +	/* hardware_enable_nolock() checks CPU compatibility on each CPUs. */
> > > +	r = hardware_enable_all();
> > > +	if (r)
> > > +		goto out_free_2;
> > > +	/*
> > > +	 * Arch specific initialization that requires to enable virtualization
> > > +	 * feature.  e.g. TDX module initialization requires VMXON on all
> > > +	 * present CPUs.
> > > +	 */
> > > +	kvm_arch_post_hardware_enable_setup(opaque);
> > > +	/*
> > > +	 * Make hardware disabled after the KVM module initialization.  KVM
> > > +	 * enables hardware when the first KVM VM is created and disables
> > > +	 * hardware when the last KVM VM is destroyed.  When no KVM VM is
> > > +	 * running, hardware is disabled.  Keep that semantics.
> > > +	 */
> > 
> > Except the first sentence, the remaining sentences are more like changelog
> > material.  Perhaps just say something below to be more specific on the purpose:
> > 
> > 	/*
> > 	 * Disable hardware on all cpus so that out-of-tree drivers which
> > 	 * also use hardware-assisted virtualization (such as virtualbox
> > 	 * kernel module) can still be loaded when KVM is loaded.
> > 	 */
> > 
> > > +	hardware_disable_all();
> > >  
> > >  	r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
> > >  				      kvm_starting_cpu, kvm_dying_cpu);
> 
> I've been poking at the "hardware enable" code this week for other reasons, and
> have come to the conclusion that the current implementation is a mess.

Thanks for the lengthy reply :)

First of all, to clarify, I guess by "current implementation" you mean the
current upstream KVM code, but not this particular patch? :)

> 
> x86 overloads "hardware enable" to do three different things:
> 
>   1. actually enable hardware
>   2. snapshot per-CPU MSR value for user-return MSRs
>   3. handle unstable TSC _for existing VMs_ on suspend+resume and/or CPU hotplug
> 
> #2 and #3 have nothing to do with enabling hardware, kvm_arch_hardware_enable() just
> so happens to be called in a superset of what is needed for dealing with unstable TSCs,
> and AFAICT the user-return MSRs is simply a historical wart.  The user-return MSRs
> code is subtly very, very nasty, as it means that KVM snaphots MSRs from IRQ context,
> e.g. if an out-of-tree module is running VMs, the IRQ can interrupt the _guest_ and
> cause KVM to snapshot guest registers.  VMX and SVM kinda sorta guard against this
> by refusing to load if VMX/SVM are already enabled, but it's not foolproof.
> 
> Eww, and #3 is broken.  If CPU (un)hotplug collides with kvm_destroy_vm() or
> kvm_create_vm(), kvm_arch_hardware_enable() could explode due to vm_list being
> modified while it's being walked.

Agreed.

> 
> Of course, that path is broken for other reasons too, e.g. needs to prevent CPUs
> from going on/off-line when KVM is enabling hardware.
> https://lore.kernel.org/all/20220216031528.92558-7-chao.gao@xxxxxxxxx

If I read correctly, the problem described in above link seems only to be true
after we move CPUHP_AP_KVM_STARTING from STARTING section to ONLINE section, but
this hasn't been done yet in the current upstream KVM.  Currently,
CPUHP_AP_KVM_STARTING is still in STARTING section so it is guaranteed it has
been executed before start_secondary sets itself to online cpu mask. 

Btw I saw v4 of Chao's patchset was sent Feb this year.  It seems that series
indeed improved CPU compatibility check and hotplug handling.  Any reason that
series wasn't merged?

> 
> arm64 is also quite evil and circumvents KVM's hardware enabling logic to some extent.
> kvm_arch_init() => init_subsystems() unconditionally enables hardware, and for pKVM
> _leaves_ hardware enabled.  And then hyp_init_cpu_pm_notifier() disables/enables
> hardware across lower power enter+exit, except if pKVM is enabled.  The icing on
> the cake is "disabling" hardware doesn't even do anything (AFAICT) if the kernel is
> running at EL2 (which I think is nVHE + not-pKVM?).
> 
> PPC apparently didn't want to be left out of the party, and despite having a nop
> for kvm_arch_hardware_disable(), it does its own "is KVM enabled" tracking (see
> kvm_hv_vm_(de)activated()).  At least PPC gets the cpus_read_(un)lock() stuff right...
> 
> MIPS doesn't appear to have any shenanigans, but kvm_vz_hardware_enable() appears
> to be a "heavy" operation, i.e. ideally not something that should be done spuriously.
> 
> s390 and PPC are the only sane architectures and don't require explicit enabling
> of virtualization.
> 
> At a glance, arm64 won't explode, but enabling hardware _twice_ during kvm_init()
> is all kinds of gross.
> 
> Another wart that we can clean up is the cpus_hardware_enabled mask.  I don't see
> any reason KVM needs to use a global mask, a per-cpu variable a la kvm_arm_hardware_enabled
> would do just fine.
> 
> OMG, and there's another bug lurking (I need to stop looking at this code).  Commit
> 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") added an error
> path that can cause VM creation to fail _after_ it has been added to the list, but
> doesn't unwind _any_ of the stuff done by kvm_arch_post_init_vm() and beyond.
> 
> Rather than trying to rework common KVM to fit all the architectures random needs,
> I think we should instead overhaul the entire mess.  And we should do that ASAP
> ahead of TDX, though obviously with an eye toward not sucking for TDX.
> 
> Not 100% thought out at this point, but I think we can do:
> 
>   1.  Have x86 snapshot per-CPU user-return MRS on first use (trivial to do by adding
>       a flag to struct kvm_user_return_msrs, as user_return_msrs is already per-CPU).
> 
>   2.  Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock and
>       cpu_read_lock().

Agreed spinlock should/can be removed/replaced.  It seems there's no need to use
spinlock.

Also agreed that kvm_lock should be used.  But I am not sure whether
cpus_read_lock() is needed (whether CPU hotplug should be prevented).  In
current KVM, we don't do CPU compatibility check for hotplug CPU anyway, so when
KVM does CPU compatibility check using for_each_online_cpu(), if CPU hotplug
(hot-removal) happens, the worst case is we lose compatibility check on that
CPU.

Or perhaps I am missing something?

> 
>   3.  Provide arch hooks that are invoked for "power management" operations (including
>       CPU hotplug and host reboot, hence the quotes).  Note, there's both a platform-
>       wide PM notifier and a per-CPU notifier...
> 
>   4.  Rename kvm_arch_post_init_vm() to e.g. kvm_arch_add_vm(), call it under
>       kvm_lock, and pass in kvm_usage_count.
> 
>   5a. Drop cpus_hardware_enabled and drop the common hardware enable/disable code.
> 
>  or 
> 
>   5b. Expose kvm_hardware_enable_all() and/or kvm_hardware_enable() so that archs
>       don't need to implement their own error handling and per-CPU flags.
> 
> I.e. give each architecture hooks to handle possible transition points, but otherwise
> let arch code decide when and how to do hardware enabling/disabling. 
> 
> I'm very tempted to vote for (5a); x86 is the only architecture has an error path
> in kvm_arch_hardware_enable(), and trying to get common code to play nice with arm's
> kvm_arm_hardware_enabled logic is probably going to be weird.
> 
> E.g. if we can get the back half kvm_create_vm() to look like the below, then arch
> code can enable hardware during kvm_arch_add_vm() if the existing count is zero
> without generic KVM needing to worry about when hardware needs to be enabled and
> disabled.
> 
> 	r = kvm_arch_init_vm(kvm, type);
> 	if (r)
> 		goto out_err_no_arch_destroy_vm;
> 
> 	r = kvm_init_mmu_notifier(kvm);
> 	if (r)
> 		goto out_err_no_mmu_notifier;
> 
> 	/*
> 	 * When the fd passed to this ioctl() is opened it pins the module,
> 	 * but try_module_get() also prevents getting a reference if the module
> 	 * is in MODULE_STATE_GOING (e.g. if someone ran "rmmod --wait").
> 	 */
> 	if (!try_module_get(kvm_chardev_ops.owner)) {
> 		r = -ENODEV;
> 		goto out_err;
> 	}
> 
> 	mutex_lock(&kvm_lock);
> 	cpus_read_lock();
> 	r = kvm_arch_add_vm(kvm, kvm_usage_count);

Holding cpus_read_lock() here implies CPU hotplug cannot happen during
kvm_arch_add_vm().  This needs a justification/comment to explain why.  

Also, assuming we have a justification, since (based on your description above)
arch _may_ choose to enable hardware within it, but it is not a _must_.  So
maybe remove cpus_read_lock() here and let kvm_arch_add_vm() to decide whether
to use it?

> 	if (r)
> 		goto out_final;
> 	kvm_usage_count++;
> 	list_add(&kvm->vm_list, &vm_list);
> 	cpus_read_unlock();
> 	mutex_unlock(&kvm_lock);
> 
> 	if (r)
> 		goto out_put_module;
> 
> 	preempt_notifier_inc();
> 	kvm_init_pm_notifier(kvm);
> 
> 	return kvm;
> 
> out_final:
> 	cpus_read_unlock();
> 	mutex_unlock(&kvm_lock);
> 	module_put(kvm_chardev_ops.owner);
> out_err_no_put_module:
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> 	if (kvm->mmu_notifier.ops)
> 		mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
> #endif
> out_err_no_mmu_notifier:
> 	kvm_arch_destroy_vm(kvm);