Re: [PATCH] KVM: VMX: Make setup_vmcs_config() preemption disabled

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 7 Mar 2023 09:17:55 -0800

On Thu, Mar 02, 2023, Huang, Kai wrote:
> On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote:
> > On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote:
> > > Make setup_vmcs_config() preemption disabled so it always performs on
> > > the same local cpu.
> > > 
> > > During module loading time, KVM intends to call setup_vmcs_config() to
> > > set up the global VMCS configurations on _one_ cpu in hardware_setup(),

That may have been the very original intention, but I don't think it has been the
true intention for a very long time.

> > > Change the existing setup_vmcs_config() to __setup_vmcs_config() and
> > > call the latter directly in the compatibility check code path.  Change
> > > setup_vmcs_config() to call __setup_vmcs_config() with preemption
> > > disabled so __setup_vmcs_config() is always done on the same cpu.
> > 
> > Maybe you can simply disable preemption in hardware_setup() although I
> > don't have a strong preference.
> > 
> > nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of
> > vmcs_conf, should it be called on the same CPU as setup_vmcs_config()?
> 
> Yes I think so.  I missed this :)
> 
> Not sure whether there are other similar places too even outside of
> hardware_setup().
> 
> But compatibility check only checks things calculated via setup_vmcs_config()
> and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put
> hardware_setup() inside preemption disabled.

Disabling preemption across hardware_setup() isn't feasible as there are a number
of allocations that might sleep.  But disabling preemption isn't necessary to
ensure setup runs on one CPU, that only requires disabling _migration_.  So _if_
we want to handle this in the kernel, we could simply do:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 541982de5762..9126fdf02649 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9470,7 +9470,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
        int r;
 
        mutex_lock(&vendor_module_lock);
+       migrate_disable();
        r = __kvm_x86_vendor_init(ops);
+       migrate_enable();
        mutex_unlock(&vendor_module_lock);
 
        return r;


But I'm not convinced we should handle this in the kernel.  Many of the checks,
especially in SVM, query boot_cpu_has(), not this_cpu_has(), i.e. to truly perform
setup on a single CPU, all of those would need to be converted to this_cpu_has().

Some of those boot_cpu_has() calls should be changed regardless of whether or not
migration is disabled, e.g. kvm_is_svm_supported() is arguably straight up buggy
due to cpu_has_svm() checking the boot CPU (I'll fix that by adding a patch after
open coding cpu_has_svm() into kvm_is_svm_supported()[*]).

But things like kvm_timer_init() should NOT be blindlgly converted to this_cpu_has(),
because the teardown path needs to mirror the setup path, e.g. if KVM ended up
running on frankenstein hardware where not all CPUs have a constant TSC, KVM could
leave a callback dangling and hose the kernel.  Obviously such hardware wouldn't
correctly run VMs, but crashing the kernel is a touch worse than KVM not working
correctly.

I'm not totally against converting to this_cpu_has() for the setup, as it would be
more intuitive in a lot of ways.  But, I don't think pinning the task actually
hardens KVM in a meaningful way.  If there are any divergences between CPUs, then
either KVM will notice before running VMs, e.g. the VMCS sanity checks, or KVM will
never notice, e.g. the myriad runtime paths that check boot_cpu_has() (or variants
thereof) without sanity checking across CPUs.  And if userspace _really_ wants to
have guarantees about how setup is performed, e.g. for repeatable, deterministic
behavior, then userspace should force loading of KVM to be done on CPU0.

So my vote is to leave things as-is (modulo the cpu_has_svm() mess).  But maybe add
documentation to explain the caveats about loading KVM, and how userspace can
mitigate those caveats?

[*] https://lore.kernel.org/all/20221201232655.290720-14-seanjc@xxxxxxxxxx