Re: [PATCH v5 05/15] KVM: nVMX: Let userspace set nVMX MSR to any _host_ supported value

Yu Zhang <yu.c.zhang@xxxxxxxxxxxxxxx> · Wed, 2 Nov 2022 16:54:14 +0800

On Tue, Nov 01, 2022 at 05:58:21PM +0000, Sean Christopherson wrote:
> On Tue, Nov 01, 2022, Yu Zhang wrote:
> > On Mon, Oct 31, 2022 at 05:11:10PM +0000, Sean Christopherson wrote:
> > > vmcs_config.nested never becomes out-of-date, it's read-only after __init (not
> > > currently marked as such, that will be remedied soon).
> > > 
> > > The auditing performed by KVM is purely to guard against userspace enabling
> > > features that KVM doesn't support.  KVM is not responsible for ensuring that the
> > > vCPU's CPUID model match the VMX MSR model.
> > 
> > Do you mean the VMX MSR model shall not be changed after the cpuid updates?
> 
> No, I mean that the virtual CPU model (CPUID + VMX MSRs) that is presented to the
> guest is the responsibility of host userspace.  KVM only cares about not enabling
> bits/features that KVM doesn't supported.
> 

Oh, I see. We need to guarantee the userspace VMM can not successfully
set a feature in vmx msr, if KVM does not support it.

> > And for VMX MSR model, do you mean the vmx->nested.msrs or the ones in 
> > vmcs_config->nested? 
> 
> vmx->nested.msrs.  vmcs_config->nested is effectively the VMX equivalent of
> KVM_GET_SUPPORTED_CPUID.
> 
> > What I observed is that vmx->nested.msrs.secondary_ctls_high will be changed
> > in vmx_adjust_secondary_exec_control(), which can be triggered after cpuid is
> > set. 
> 
> Ugh, that path got overlooked when we yanked out KVM's manipulaton of VMX MSRs
> in response to guest CPUID changes.  I wonder if we can get away with changing
> KVM's behavior to only ensure a feature isn't exposed to L2 when it's not exposed
> to L1.
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 6b4266e949a3..cfc35d559d91 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4523,8 +4523,8 @@ vmx_adjust_secondary_exec_control(struct vcpu_vmx *vmx, u32 *exec_control,
>          * Update the nested MSR settings so that a nested VMM can/can't set
>          * controls for features that are/aren't exposed to the guest.
>          */
> -       if (nested) {
> -               if (enabled)
> +       if (nested && !enabled)
> +               if (exiting)
>                         vmx->nested.msrs.secondary_ctls_high |= control;
>                 else
>                         vmx->nested.msrs.secondary_ctls_high &= ~control;
> 

Indeed, this change can make sure a feature won't be exposed to L2, if L1
does not have it. But for the feature bits that L1 has, yet cleared from
the vmcs_conf->nested.msrs.secondary_ctls_high in nested_vmx_setup_ctls_msrs(),
there's no chance for userspace vmm to reset it again.

Well, I am not suggesting to give userspace vmm such permission(which I believe
is incorrect). And IIUC, vmcs_conf->nested.msrs.secondary_ctls_high shall serve
as a template to initialize vmx->nested.msrs.secondary_ctls_high. So maybe we
shall not mask off some features in nested_vmx_setup_ctls_msrs() at the beginning.
 
> > Since KVM's config(vmcs_config->nested.secondary_ctls_high) is done during init
> > by nested_vmx_setup_ctls_msrs(), which only kept a subset of the flags from the
> > vmcs_confg->cpu_based_2nd_exec_ctrl, the vmx_restore_control_msr() could fail
> > later, when userspace VMM tries to enable a feature(the only one I witnessed is
> > SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE) by setting MSR_IA32_VMX_PROCBASED_CTLS2.
> > Because the vmx->nested.msrs.secondary_ctls_high is updated by cpuid, but this
> > bit is not taken from vmcs_conf->cpu_based_2nd_exec_ctrl by nested_vmx_setup_ctls_msrs()
> > for vmcs_config->nested.secondary_ctls_high.
> > 
> > The failure can be fixed, simply by adding SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE in
> > nested_vmx_setup_ctls_msrs(), e.g.
> 
> Assuming KVM actually supports user wait/pause in L2, this is an orthogonal bug
> to the CPUID-based manipulation above.  KVM simply neglects to advertise to userspace
> that ENABLE_USR_WAIT_PAUSE is supported for nested virtualization.
> 
> If KVM doesn't correctly support virtualizing user wait/pause for L2, then the
> correct location to fix this is in vmx_secondary_exec_control().
> 

Sorry, why vmx_secondary_exec_control()? Could we just change
nested_vmx_setup_ctls_msrs() like below:

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 0c62352dda6a..fa255391718c 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -6791,13 +6791,7 @@ void nested_vmx_setup_ctls_msrs(struct vmcs_config *vmcs_conf, u32 ept_caps)
        msrs->procbased_ctls_low &=
                ~(CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING);

-       /*
-        * secondary cpu-based controls.  Do not include those that
-        * depend on CPUID bits, they are added later by
-        * vmx_vcpu_after_set_cpuid.
-        */
        msrs->secondary_ctls_low = 0;
-
        msrs->secondary_ctls_high = vmcs_conf->cpu_based_2nd_exec_ctrl;
        msrs->secondary_ctls_high &=
                SECONDARY_EXEC_DESC |
@@ -6810,7 +6804,8 @@ void nested_vmx_setup_ctls_msrs(struct vmcs_config *vmcs_conf, u32 ept_caps)
                SECONDARY_EXEC_ENABLE_INVPCID |
                SECONDARY_EXEC_RDSEED_EXITING |
                SECONDARY_EXEC_XSAVES |
-               SECONDARY_EXEC_TSC_SCALING;
+               SECONDARY_EXEC_TSC_SCALING |
+               SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE;

        /*
         * We can emulate "VMCS shadowing," even if the hardware

Note: I did not use "if (cpu_has_vmx_waitpkg())" here, it looks like to
take off one's pants to fart(no offense, just a Chinese old saying meaning
unnecessary acts.:)).

> > > > Another question is about the setting of secondary_ctls_high in
> > > > nested_vmx_setup_ctls_msrs().  I saw there's a comment saying:
> > > > 	"Do not include those that depend on CPUID bits, they are
> > > > 	added later by vmx_vcpu_after_set_cpuid.".
> > > 
> > > That's a stale comment, see the very next commit, 8805875aa473 ("Revert "KVM: nVMX:
> > > Do not expose MPX VMX controls when guest MPX disabled""), as well as the slightly
> > > later commit 9389d5774aca ("Revert "KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
> > > VM-{Entry,Exit} control"").
> > > 
> > 
> > So the comment can be and shall be removed, right? 
> 
> Yep.
> 
> > > > 	if (cpu_has_vmx_vmfunc()) {
> > > > 		msrs->secondary_ctls_high |=
> > > > 			SECONDARY_EXEC_ENABLE_VMFUNC;
> > > 
> > > This one is still required.  KVM never enables VMFUNC for itself, i.e. it won't
> > > be set in KVM's VMCS configuration.
> > > 
> > 
> > My understanding is that, for VMFUNC, altough KVM does not support it,
> > SECONDARY_EXEC_ENABLE_VMFUNC is still set in the secondary proc-based
> > vm execution ctrol. KVM just uses different handlers for VMFUNC exits
> > from L1(to inject the #UD) and L2(to emulate the eptp switching). So
> > it doesn't matter if we do not clear this bit for vmcs_config->nested.
> > procbased_ctls_high. 
> 
> Ah, you're right, I didn't realize KVM enables VMFUNC in L1.  Enabling VMFUNC for
> L1 is silly though, it's trivial to clear the feature in vmx_secondary_exec_control().
> 
> That said, enabling VMFUNC in vmcs01 is an orthogonal topic, and it _is_ indeed
> easier to keep the feature in the reference config.  Now that the nested config
> is derived from the non-nested config, nested_vmx_setup_ctls_msrs() can do:

Agreed. BTW, do you know why KVM took pains to do so? I mean, emulation for
L2's vmfunc does not rely on the existance of vmfunc, right? So, for L2,
we can just set vmcs02's SECONDARY_EXEC_ENABLE_VMFUNC based on vmcs12. And
for L1, we can just disable it by clearing it in vmx_secondary_exec_control(),
and remove the #UD injection logic from KVM?

B.R.
Yu