On Thu, Nov 03, 2022 at 04:53:11PM +0000, Sean Christopherson wrote: > > > > > > Ugh, that path got overlooked when we yanked out KVM's manipulaton of VMX MSRs > > > in response to guest CPUID changes. I wonder if we can get away with changing > > > KVM's behavior to only ensure a feature isn't exposed to L2 when it's not exposed > > > to L1. > > > > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > > index 6b4266e949a3..cfc35d559d91 100644 > > > --- a/arch/x86/kvm/vmx/vmx.c > > > +++ b/arch/x86/kvm/vmx/vmx.c > > > @@ -4523,8 +4523,8 @@ vmx_adjust_secondary_exec_control(struct vcpu_vmx *vmx, u32 *exec_control, > > > * Update the nested MSR settings so that a nested VMM can/can't set > > > * controls for features that are/aren't exposed to the guest. > > > */ > > > - if (nested) { > > > - if (enabled) > > > + if (nested && !enabled) > > > + if (exiting) > > > vmx->nested.msrs.secondary_ctls_high |= control; > > > else > > > vmx->nested.msrs.secondary_ctls_high &= ~control; > > > > > > > Indeed, this change can make sure a feature won't be exposed to L2, if L1 > > does not have it. > > No, that's not the goal of the change. KVM already hides features in the VMX MSRs > if the base feature is not supported in L1 according to guest CPUID. The problem > is that, currently, KVM also _forces_ features to be enabled in the VMX MSRs when > the base feature IS supported in L1 (CPUID). > > Ideally, KVM should NEVER manipulate VMX MSRs in response to guest CPUID changes. > That's what I was referring to earlier by commits: > > 8805875aa473 ("Revert "KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled"") > 9389d5774aca ("Revert "KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL"") > > E.g. if userspace sets VMX MSRs and then sets guest CPUID, KVM will override the > nVMX CPU model defined by userspace. The scenario where userspace hides a "base" > feature but exposes the feature in the VMX MSRs is nonsensical, which is why I > think KVM can likely get away with force-disabling features. > > The reverse is completely legitimate though: hiding a feature in VMX MSRs even if > the base feature is supported for L1, i.e. disallowing L1 from enabling the feature > in L2, is something that real VMMs will actually do, e.g. if the user doesn't trust > that KVM correctly handles all aspects of nested virtualization for the feature. Thanks Sean. Let me try to rephrase my understandings of your statement( and pls feel free to correct me): 1> For now, what vmx_adjust_secondary_exec_control() does, is to enable/ disable a feature in VMX MSR(and nVMX MSR) based on cpuid changes. 2> What makes sense is, if a feature is a. disabled by guest CPUID, it shall not be exposed in guest VMX MSR; b. enabled by guest CPUID, it could be either exposed or hidden in guest VMX MSR. 3> So your previous change is to guarantee 2.a, and userspace VMM can choose to follow follow either choices in 2.b(depending on whether it believes this feature is correctly supported by KVM in nested). Is above understanding correct? But what if userspace VMM sets guest CPUID first, disabling a feature, and then sets the guest VMX MSR bit with this feature enabled? Does KVM need to check guest CPUID again, in vmx_restore_control_msr()? I do not think above scenario is what QEMU does - QEMU checks guest CPUID first with kvm_arch_get_supported_cpuid() before trying to set guest VMX MSR. But I am not sure if this is mandatory step for all userspace VMM. > > In other words, the behavior you're observing, where vmx->nested.msrs.secondary_ctls_high > is changed by vmx_adjust_secondary_exec_control(), is a completely separate bug > than the one below. > > > > > > > Assuming KVM actually supports user wait/pause in L2, this is an orthogonal bug > > > to the CPUID-based manipulation above. KVM simply neglects to advertise to userspace > > > that ENABLE_USR_WAIT_PAUSE is supported for nested virtualization. > > > > > > If KVM doesn't correctly support virtualizing user wait/pause for L2, then the > > > correct location to fix this is in vmx_secondary_exec_control(). > > > > > > > Sorry, why vmx_secondary_exec_control()? > > You missed the qualifier: > > If KVM doesn't correctly support virtualizing user wait/pause for L2 > > If KVM should NOT be exposing ENABLE_USR_WAIT_PAUSE to the L1 VMM, then NOT > propagating the feature to msrs->secondary_ctls_low is correct. And if that's > the case, then vmx_secondary_exec_control() needs to be modified so that it does > NOT set ENABLE_USR_WAIT_PAUSE in vmx->nested.msrs.secondary_ctls_high. > > > Could we just change nested_vmx_setup_ctls_msrs() like below: > > If KVM correctly virtualizes the feature in a nested scenario, yes. I haven't > looked into ENABLE_USR_WAIT_PAUSE enough to know whether or not KVM gets the > nested virtualization pieces correct, hence the above qualifier. > Got it. I'll check that. Thanks! B.R. Yu