On Wed, 2023-03-08 at 13:03 -0800, Sean Christopherson wrote: > On Wed, Mar 08, 2023, Huang, Kai wrote: > > On Tue, 2023-03-07 at 09:17 -0800, Sean Christopherson wrote: > > > On Thu, Mar 02, 2023, Huang, Kai wrote: > > > > On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote: > > > > > On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote: > > > > > > Make setup_vmcs_config() preemption disabled so it always performs on > > > > > > the same local cpu. > > > > > > > > > > > > During module loading time, KVM intends to call setup_vmcs_config() to > > > > > > set up the global VMCS configurations on _one_ cpu in hardware_setup(), > > > > > > That may have been the very original intention, but I don't think it has been the > > > true intention for a very long time. > > > > Wondering what's the current intention? > > I don't think there's a deliberate "intention" beyond "does it work?". Like many > of the historical bits of KVM x86, I think this is a case of the original authors > _wanting_ to provide certain behavior, but not actually ensuring that behavior in > code. Yep. > > > > > > > Change the existing setup_vmcs_config() to __setup_vmcs_config() and > > > > > > call the latter directly in the compatibility check code path. Change > > > > > > setup_vmcs_config() to call __setup_vmcs_config() with preemption > > > > > > disabled so __setup_vmcs_config() is always done on the same cpu. > > > > > > > > > > Maybe you can simply disable preemption in hardware_setup() although I > > > > > don't have a strong preference. > > > > > > > > > > nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of > > > > > vmcs_conf, should it be called on the same CPU as setup_vmcs_config()? > > > > > > > > Yes I think so. I missed this :) > > > > > > > > Not sure whether there are other similar places too even outside of > > > > hardware_setup(). > > > > > > > > But compatibility check only checks things calculated via setup_vmcs_config() > > > > and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put > > > > hardware_setup() inside preemption disabled. > > > > > > Disabling preemption across hardware_setup() isn't feasible as there are a number > > > of allocations that might sleep. But disabling preemption isn't necessary to > > > ensure setup runs on one CPU, that only requires disabling _migration_. So _if_ > > > we want to handle this in the kernel, we could simply do: > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > index 541982de5762..9126fdf02649 100644 > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -9470,7 +9470,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops) > > > int r; > > > > > > mutex_lock(&vendor_module_lock); > > > + migrate_disable(); > > > r = __kvm_x86_vendor_init(ops); > > > + migrate_enable(); > > > mutex_unlock(&vendor_module_lock); > > > > > > return r; > > > > > > > > > But I'm not convinced we should handle this in the kernel. Many of the checks, > > > especially in SVM, query boot_cpu_has(), not this_cpu_has(), i.e. to truly perform > > > setup on a single CPU, all of those would need to be converted to this_cpu_has(). > > > > > > Some of those boot_cpu_has() calls should be changed regardless of whether or not > > > migration is disabled, e.g. kvm_is_svm_supported() is arguably straight up buggy > > > due to cpu_has_svm() checking the boot CPU (I'll fix that by adding a patch after > > > open coding cpu_has_svm() into kvm_is_svm_supported()[*]). > > > > > > But things like kvm_timer_init() should NOT be blindlgly converted to this_cpu_has(), > > > because the teardown path needs to mirror the setup path, e.g. if KVM ended up > > > running on frankenstein hardware where not all CPUs have a constant TSC, KVM could > > > leave a callback dangling and hose the kernel. Obviously such hardware wouldn't > > > correctly run VMs, but crashing the kernel is a touch worse than KVM not working > > > correctly. > > > > > > I'm not totally against converting to this_cpu_has() for the setup, as it would be > > > more intuitive in a lot of ways. But, I don't think pinning the task actually > > > hardens KVM in a meaningful way. If there are any divergences between CPUs, then > > > either KVM will notice before running VMs, e.g. the VMCS sanity checks, or KVM will > > > never notice, e.g. the myriad runtime paths that check boot_cpu_has() (or variants > > > thereof) without sanity checking across CPUs. And if userspace _really_ wants to > > > have guarantees about how setup is performed, e.g. for repeatable, deterministic > > > behavior, then userspace should force loading of KVM to be done on CPU0. > > > > My intention is never for userspace, but simply/purely from compatibility > > check's point of view (see below). Also, I don't think userspace wants to > > guarantee anything -- it just wants to load the KVM module. > > That very much depends on the use case. For personal usage of KVM, it's extremely > unlikely that userspace is doing anything remotely sophisticated. But for a more > "formal" deployment, userspace absolutely has its hands all over the system, e.g. > scheduling VMs across systems, monitoring the health of the system, etc. Whether > or not userspaces actually do tightly control loading KVM is another matter... Agreed. > > > It's even arguable that it may be an acceptable behaviour to fail to run any > > VM even loading module was successful. > > > > > > > > So my vote is to leave things as-is (modulo the cpu_has_svm() mess). But maybe add > > > documentation to explain the caveats about loading KVM, and how userspace can > > > mitigate those caveats? > > > > I made this patch because I have some other patches to move VMXON support out of > > KVM in order to support TDX, but so far those patches are not included in that > > series (and I'd like to leave it out if we really don't need it). > > Me too. :-) > > > In the patch to move VMXON out of KVM, I changed to use per-cpu variable to > > cache the MSR_IA32_VMX_BASIC value and setup the VMXON region when one CPU is > > becoming online. And setup_vmcs_config() is changed to use __this_cpu_read() to > > read the per-cpu MSR value instead of reading from hardware. Obviously w/o > > preempt_disable() or similar __this_cpu_read() can report kernel bug: > > > > printk(KERN_ERR "BUG: using %s%s() in preemptible [%08x] code: %s/%d\n", > > what1, what2, preempt_count() - 1, current->comm, current->pid); > > > > That being said, I am fine to keep existing code, even w/o documenting. We can > > discuss more how to handle when we really want to move VMXON out of KVM (i.e. > > supporting TDX IO?). > > > > Or we can just fix compatibility check part? For instance, move > > setup_vmcs_config() and nested_vmx_setup_ctls_msrs() together in > > hardware_setup() and call preempt_disable() around them? > > Eh, the compatibility checks we really care about run in IRQ context, i.e. they're > guaranteed to have a stable CPU. Splitting the _setup_ for the compatibility > checks across multiple CPUs isn't a problem because KVM will still get the right > "answer", i.e. any divergence will be detected (barring _very_ flaky hardware that > might get false negatives anyways). > > Don't get me wrong, I agree it's ugly, but I don't want to go halfway. I either > want to guard the whole thing, or nothing, and I can't convince myself that > guarding everything is worthwhile since userspace can (and IMO should) do a better > job. Agreed. Let's just leave the current code as is. Thanks for your time!