On Thu, 2022-08-11 at 17:39 +0000, Sean Christopherson wrote: > +Will (for arm crud) > > On Thu, Aug 11, 2022, Huang, Kai wrote: > > First of all, I think the patch title can be improved. "refactor CPU > > compatibility check on module initialization" isn't the purpose of this patch. > > It is just a bonus. The title should reflect the main purpose (or behaviour) of > > this patch: > > > > KVM: Temporarily enable hardware on all cpus during module loading time > > ... > > > > + /* hardware_enable_nolock() checks CPU compatibility on each CPUs. */ > > > + r = hardware_enable_all(); > > > + if (r) > > > + goto out_free_2; > > > + /* > > > + * Arch specific initialization that requires to enable virtualization > > > + * feature. e.g. TDX module initialization requires VMXON on all > > > + * present CPUs. > > > + */ > > > + kvm_arch_post_hardware_enable_setup(opaque); > > > + /* > > > + * Make hardware disabled after the KVM module initialization. KVM > > > + * enables hardware when the first KVM VM is created and disables > > > + * hardware when the last KVM VM is destroyed. When no KVM VM is > > > + * running, hardware is disabled. Keep that semantics. > > > + */ > > > > Except the first sentence, the remaining sentences are more like changelog > > material. Perhaps just say something below to be more specific on the purpose: > > > > /* > > * Disable hardware on all cpus so that out-of-tree drivers which > > * also use hardware-assisted virtualization (such as virtualbox > > * kernel module) can still be loaded when KVM is loaded. > > */ > > > > > + hardware_disable_all(); > > > > > > r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting", > > > kvm_starting_cpu, kvm_dying_cpu); > > I've been poking at the "hardware enable" code this week for other reasons, and > have come to the conclusion that the current implementation is a mess. Thanks for the lengthy reply :) First of all, to clarify, I guess by "current implementation" you mean the current upstream KVM code, but not this particular patch? :) > > x86 overloads "hardware enable" to do three different things: > > 1. actually enable hardware > 2. snapshot per-CPU MSR value for user-return MSRs > 3. handle unstable TSC _for existing VMs_ on suspend+resume and/or CPU hotplug > > #2 and #3 have nothing to do with enabling hardware, kvm_arch_hardware_enable() just > so happens to be called in a superset of what is needed for dealing with unstable TSCs, > and AFAICT the user-return MSRs is simply a historical wart. The user-return MSRs > code is subtly very, very nasty, as it means that KVM snaphots MSRs from IRQ context, > e.g. if an out-of-tree module is running VMs, the IRQ can interrupt the _guest_ and > cause KVM to snapshot guest registers. VMX and SVM kinda sorta guard against this > by refusing to load if VMX/SVM are already enabled, but it's not foolproof. > > Eww, and #3 is broken. If CPU (un)hotplug collides with kvm_destroy_vm() or > kvm_create_vm(), kvm_arch_hardware_enable() could explode due to vm_list being > modified while it's being walked. Agreed. > > Of course, that path is broken for other reasons too, e.g. needs to prevent CPUs > from going on/off-line when KVM is enabling hardware. > https://lore.kernel.org/all/20220216031528.92558-7-chao.gao@xxxxxxxxx If I read correctly, the problem described in above link seems only to be true after we move CPUHP_AP_KVM_STARTING from STARTING section to ONLINE section, but this hasn't been done yet in the current upstream KVM. Currently, CPUHP_AP_KVM_STARTING is still in STARTING section so it is guaranteed it has been executed before start_secondary sets itself to online cpu mask. Btw I saw v4 of Chao's patchset was sent Feb this year. It seems that series indeed improved CPU compatibility check and hotplug handling. Any reason that series wasn't merged? > > arm64 is also quite evil and circumvents KVM's hardware enabling logic to some extent. > kvm_arch_init() => init_subsystems() unconditionally enables hardware, and for pKVM > _leaves_ hardware enabled. And then hyp_init_cpu_pm_notifier() disables/enables > hardware across lower power enter+exit, except if pKVM is enabled. The icing on > the cake is "disabling" hardware doesn't even do anything (AFAICT) if the kernel is > running at EL2 (which I think is nVHE + not-pKVM?). > > PPC apparently didn't want to be left out of the party, and despite having a nop > for kvm_arch_hardware_disable(), it does its own "is KVM enabled" tracking (see > kvm_hv_vm_(de)activated()). At least PPC gets the cpus_read_(un)lock() stuff right... > > MIPS doesn't appear to have any shenanigans, but kvm_vz_hardware_enable() appears > to be a "heavy" operation, i.e. ideally not something that should be done spuriously. > > s390 and PPC are the only sane architectures and don't require explicit enabling > of virtualization. > > At a glance, arm64 won't explode, but enabling hardware _twice_ during kvm_init() > is all kinds of gross. > > Another wart that we can clean up is the cpus_hardware_enabled mask. I don't see > any reason KVM needs to use a global mask, a per-cpu variable a la kvm_arm_hardware_enabled > would do just fine. > > OMG, and there's another bug lurking (I need to stop looking at this code). Commit > 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") added an error > path that can cause VM creation to fail _after_ it has been added to the list, but > doesn't unwind _any_ of the stuff done by kvm_arch_post_init_vm() and beyond. > > Rather than trying to rework common KVM to fit all the architectures random needs, > I think we should instead overhaul the entire mess. And we should do that ASAP > ahead of TDX, though obviously with an eye toward not sucking for TDX. > > Not 100% thought out at this point, but I think we can do: > > 1. Have x86 snapshot per-CPU user-return MRS on first use (trivial to do by adding > a flag to struct kvm_user_return_msrs, as user_return_msrs is already per-CPU). > > 2. Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock and > cpu_read_lock(). Agreed spinlock should/can be removed/replaced. It seems there's no need to use spinlock. Also agreed that kvm_lock should be used. But I am not sure whether cpus_read_lock() is needed (whether CPU hotplug should be prevented). In current KVM, we don't do CPU compatibility check for hotplug CPU anyway, so when KVM does CPU compatibility check using for_each_online_cpu(), if CPU hotplug (hot-removal) happens, the worst case is we lose compatibility check on that CPU. Or perhaps I am missing something? > > 3. Provide arch hooks that are invoked for "power management" operations (including > CPU hotplug and host reboot, hence the quotes). Note, there's both a platform- > wide PM notifier and a per-CPU notifier... > > 4. Rename kvm_arch_post_init_vm() to e.g. kvm_arch_add_vm(), call it under > kvm_lock, and pass in kvm_usage_count. > > 5a. Drop cpus_hardware_enabled and drop the common hardware enable/disable code. > > or > > 5b. Expose kvm_hardware_enable_all() and/or kvm_hardware_enable() so that archs > don't need to implement their own error handling and per-CPU flags. > > I.e. give each architecture hooks to handle possible transition points, but otherwise > let arch code decide when and how to do hardware enabling/disabling. > > I'm very tempted to vote for (5a); x86 is the only architecture has an error path > in kvm_arch_hardware_enable(), and trying to get common code to play nice with arm's > kvm_arm_hardware_enabled logic is probably going to be weird. > > E.g. if we can get the back half kvm_create_vm() to look like the below, then arch > code can enable hardware during kvm_arch_add_vm() if the existing count is zero > without generic KVM needing to worry about when hardware needs to be enabled and > disabled. > > r = kvm_arch_init_vm(kvm, type); > if (r) > goto out_err_no_arch_destroy_vm; > > r = kvm_init_mmu_notifier(kvm); > if (r) > goto out_err_no_mmu_notifier; > > /* > * When the fd passed to this ioctl() is opened it pins the module, > * but try_module_get() also prevents getting a reference if the module > * is in MODULE_STATE_GOING (e.g. if someone ran "rmmod --wait"). > */ > if (!try_module_get(kvm_chardev_ops.owner)) { > r = -ENODEV; > goto out_err; > } > > mutex_lock(&kvm_lock); > cpus_read_lock(); > r = kvm_arch_add_vm(kvm, kvm_usage_count); Holding cpus_read_lock() here implies CPU hotplug cannot happen during kvm_arch_add_vm(). This needs a justification/comment to explain why. Also, assuming we have a justification, since (based on your description above) arch _may_ choose to enable hardware within it, but it is not a _must_. So maybe remove cpus_read_lock() here and let kvm_arch_add_vm() to decide whether to use it? > if (r) > goto out_final; > kvm_usage_count++; > list_add(&kvm->vm_list, &vm_list); > cpus_read_unlock(); > mutex_unlock(&kvm_lock); > > if (r) > goto out_put_module; > > preempt_notifier_inc(); > kvm_init_pm_notifier(kvm); > > return kvm; > > out_final: > cpus_read_unlock(); > mutex_unlock(&kvm_lock); > module_put(kvm_chardev_ops.owner); > out_err_no_put_module: > #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) > if (kvm->mmu_notifier.ops) > mmu_notifier_unregister(&kvm->mmu_notifier, current->mm); > #endif > out_err_no_mmu_notifier: > kvm_arch_destroy_vm(kvm);