Re: [PATCH] KVM: VMX: Make setup_vmcs_config() preemption disabled

"Huang, Kai" <kai.huang@xxxxxxxxx> · Thu, 9 Mar 2023 02:11:13 +0000

On Wed, 2023-03-08 at 13:03 -0800, Sean Christopherson wrote:
> On Wed, Mar 08, 2023, Huang, Kai wrote:
> > On Tue, 2023-03-07 at 09:17 -0800, Sean Christopherson wrote:
> > > On Thu, Mar 02, 2023, Huang, Kai wrote:
> > > > On Thu, 2023-03-02 at 13:36 +0800, Gao, Chao wrote:
> > > > > On Wed, Mar 01, 2023 at 11:54:38PM +1300, Kai Huang wrote:
> > > > > > Make setup_vmcs_config() preemption disabled so it always performs on
> > > > > > the same local cpu.
> > > > > > 
> > > > > > During module loading time, KVM intends to call setup_vmcs_config() to
> > > > > > set up the global VMCS configurations on _one_ cpu in hardware_setup(),
> > > 
> > > That may have been the very original intention, but I don't think it has been the
> > > true intention for a very long time.
> > 
> > Wondering what's the current intention?
> 
> I don't think there's a deliberate "intention" beyond "does it work?".  Like many
> of the historical bits of KVM x86, I think this is a case of the original authors
> _wanting_ to provide certain behavior, but not actually ensuring that behavior in
> code.

Yep.

> 
> > > > > > Change the existing setup_vmcs_config() to __setup_vmcs_config() and
> > > > > > call the latter directly in the compatibility check code path.  Change
> > > > > > setup_vmcs_config() to call __setup_vmcs_config() with preemption
> > > > > > disabled so __setup_vmcs_config() is always done on the same cpu.
> > > > > 
> > > > > Maybe you can simply disable preemption in hardware_setup() although I
> > > > > don't have a strong preference.
> > > > > 
> > > > > nested_vmx_setup_ctls_msrs() also reads some MSRs and sets up part of
> > > > > vmcs_conf, should it be called on the same CPU as setup_vmcs_config()?
> > > > 
> > > > Yes I think so.  I missed this :)
> > > > 
> > > > Not sure whether there are other similar places too even outside of
> > > > hardware_setup().
> > > > 
> > > > But compatibility check only checks things calculated via setup_vmcs_config()
> > > > and nested_vmx_setup_ctls_msrs(), so I think it's fair to only put
> > > > hardware_setup() inside preemption disabled.
> > > 
> > > Disabling preemption across hardware_setup() isn't feasible as there are a number
> > > of allocations that might sleep.  But disabling preemption isn't necessary to
> > > ensure setup runs on one CPU, that only requires disabling _migration_.  So _if_
> > > we want to handle this in the kernel, we could simply do:
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 541982de5762..9126fdf02649 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -9470,7 +9470,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
> > >         int r;
> > >  
> > >         mutex_lock(&vendor_module_lock);
> > > +       migrate_disable();
> > >         r = __kvm_x86_vendor_init(ops);
> > > +       migrate_enable();
> > >         mutex_unlock(&vendor_module_lock);
> > >  
> > >         return r;
> > > 
> > > 
> > > But I'm not convinced we should handle this in the kernel.  Many of the checks,
> > > especially in SVM, query boot_cpu_has(), not this_cpu_has(), i.e. to truly perform
> > > setup on a single CPU, all of those would need to be converted to this_cpu_has().
> > > 
> > > Some of those boot_cpu_has() calls should be changed regardless of whether or not
> > > migration is disabled, e.g. kvm_is_svm_supported() is arguably straight up buggy
> > > due to cpu_has_svm() checking the boot CPU (I'll fix that by adding a patch after
> > > open coding cpu_has_svm() into kvm_is_svm_supported()[*]).
> > > 
> > > But things like kvm_timer_init() should NOT be blindlgly converted to this_cpu_has(),
> > > because the teardown path needs to mirror the setup path, e.g. if KVM ended up
> > > running on frankenstein hardware where not all CPUs have a constant TSC, KVM could
> > > leave a callback dangling and hose the kernel.  Obviously such hardware wouldn't
> > > correctly run VMs, but crashing the kernel is a touch worse than KVM not working
> > > correctly.
> > > 
> > > I'm not totally against converting to this_cpu_has() for the setup, as it would be
> > > more intuitive in a lot of ways.  But, I don't think pinning the task actually
> > > hardens KVM in a meaningful way.  If there are any divergences between CPUs, then
> > > either KVM will notice before running VMs, e.g. the VMCS sanity checks, or KVM will
> > > never notice, e.g. the myriad runtime paths that check boot_cpu_has() (or variants
> > > thereof) without sanity checking across CPUs.  And if userspace _really_ wants to
> > > have guarantees about how setup is performed, e.g. for repeatable, deterministic
> > > behavior, then userspace should force loading of KVM to be done on CPU0.
> > 
> > My intention is never for userspace, but simply/purely from compatibility
> > check's point of view (see below).  Also, I don't think userspace wants to
> > guarantee anything  -- it just wants to load the KVM module.
> 
> That very much depends on the use case.  For personal usage of KVM, it's extremely
> unlikely that userspace is doing anything remotely sophisticated.  But for a more
> "formal" deployment, userspace absolutely has its hands all over the system, e.g.
> scheduling VMs across systems, monitoring the health of the system, etc.  Whether
> or not userspaces actually do tightly control loading KVM is another matter...

Agreed.

> 
> > It's even arguable that it may be an acceptable behaviour to fail to run any
> > VM even loading module was successful.
> > 
> > > 
> > > So my vote is to leave things as-is (modulo the cpu_has_svm() mess).  But maybe add
> > > documentation to explain the caveats about loading KVM, and how userspace can
> > > mitigate those caveats?
> > 
> > I made this patch because I have some other patches to move VMXON support out of
> > KVM in order to support TDX, but so far those patches are not included in that
> > series (and I'd like to leave it out if we really don't need it).
> 
> Me too. :-)
> 
> > In the patch to move VMXON out of KVM, I changed to use per-cpu variable to
> > cache the MSR_IA32_VMX_BASIC value and setup the VMXON region when one CPU is
> > becoming online.  And setup_vmcs_config() is changed to use __this_cpu_read() to
> > read the per-cpu MSR value instead of reading from hardware.  Obviously w/o
> > preempt_disable() or similar __this_cpu_read() can report kernel bug:
> > 
> >         printk(KERN_ERR "BUG: using %s%s() in preemptible [%08x] code: %s/%d\n",
> >                 what1, what2, preempt_count() - 1, current->comm, current->pid);
> > 
> > That being said, I am fine to keep existing code, even w/o documenting.  We can
> > discuss more how to handle when we really want to move VMXON out of KVM (i.e.
> > supporting TDX IO?).
> > 
> > Or we can just fix compatibility check part?  For instance, move
> > setup_vmcs_config() and nested_vmx_setup_ctls_msrs() together in
> > hardware_setup() and call preempt_disable() around them?
> 
> Eh, the compatibility checks we really care about run in IRQ context, i.e. they're
> guaranteed to have a stable CPU.  Splitting the _setup_ for the compatibility
> checks across multiple CPUs isn't a problem because KVM will still get the right
> "answer", i.e. any divergence will be detected (barring _very_ flaky hardware that
> might get false negatives anyways).
> 
> Don't get me wrong, I agree it's ugly, but I don't want to go halfway.  I either
> want to guard the whole thing, or nothing, and I can't convince myself that
> guarding everything is worthwhile since userspace can (and IMO should) do a better
> job.

Agreed.

Let's just leave the current code as is.

Thanks for your time!