On Wed, Jun 22, 2022 at 04:35:27PM +0200, Vitaly Kuznetsov wrote: > Anirudh Rayabharam <anrayabh@xxxxxxxxxxxxxxxxxxx> writes: > > > On Wed, Jun 22, 2022 at 10:00:29AM +0200, Vitaly Kuznetsov wrote: > >> Sean Christopherson <seanjc@xxxxxxxxxx> writes: > >> > >> > On Tue, Jun 14, 2022, Anirudh Rayabharam wrote: > >> >> On Mon, Jun 13, 2022 at 04:57:49PM +0000, Sean Christopherson wrote: > >> > >> ... > >> > >> >> > > >> >> > Any reason not to use the already sanitized vmcs_config? I can't think of any > >> >> > reason why the nested path should blindly use the raw MSR values from hardware. > >> >> > >> >> vmcs_config has the sanitized exec controls. But how do we construct MSR > >> >> values using them? > >> > > >> > I was thinking we could use the sanitized controls for the allowed-1 bits, and then > >> > take the required-1 bits from the CPU. And then if we wanted to avoid the redundant > >> > RDMSRs in a follow-up patch we could add required-1 fields to vmcs_config. > >> > > >> > Hastily constructed and compile-tested only, proceed with caution :-) > >> > >> Independently from "[PATCH 00/11] KVM: VMX: Support TscScaling and > >> EnclsExitingBitmap whith eVMCS" which is supposed to fix the particular > >> TSC scaling issue, I like the idea to make nested_vmx_setup_ctls_msrs() > >> use both allowed-1 and required-1 bits from vmcs_config. I'll pick up > >> the suggested patch and try to construct something for required-1 bits. > > > > I tried this patch today but it causes some regression which causes > > /dev/kvm to be unavailable in L1. I didn't get a chance to look into it > > closely but I am guessing it has something to do with the fact that > > vmcs_config reflects the config that L0 chose to use rather than what is > > available to use. So constructing allowed-1 MSR bits based on what bits > > are set in exec controls maybe isn't correct. > > I've tried to pick it up but it's actually much harder than I think. The > patch has some minor issues ('&vmcs_config.nested' needs to be switched > to '&vmcs_conf->nested' in nested_vmx_setup_ctls_msrs()), but the main > problem is that the set of controls nested_vmx_setup_ctls_msrs() needs > is NOT a subset of vmcs_config (setup_vmcs_config()). I was able to > identify at least: > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 5e14e4c40007..8076352174ad 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -2483,8 +2483,14 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf, > CPU_BASED_INVLPG_EXITING | > CPU_BASED_RDPMC_EXITING; > > - opt = CPU_BASED_TPR_SHADOW | > + opt = CPU_BASED_INTR_WINDOW_EXITING | > + CPU_BASED_NMI_WINDOW_EXITING | > + CPU_BASED_TPR_SHADOW | > + CPU_BASED_USE_IO_BITMAPS | > CPU_BASED_USE_MSR_BITMAPS | > + CPU_BASED_MONITOR_TRAP_FLAG | > + CPU_BASED_RDTSC_EXITING | > + CPU_BASED_PAUSE_EXITING | > CPU_BASED_ACTIVATE_SECONDARY_CONTROLS | > CPU_BASED_ACTIVATE_TERTIARY_CONTROLS; > if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PROCBASED_CTLS, > @@ -2582,6 +2588,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf, > #endif > opt = VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | > VM_EXIT_LOAD_IA32_PAT | > + VM_EXIT_SAVE_IA32_PAT | > VM_EXIT_LOAD_IA32_EFER | > VM_EXIT_CLEAR_BNDCFGS | > VM_EXIT_PT_CONCEAL_PIP | > @@ -2604,7 +2611,11 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf, > _pin_based_exec_control &= ~PIN_BASED_POSTED_INTR; > > min = VM_ENTRY_LOAD_DEBUG_CONTROLS; > - opt = VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | > + opt = > +#ifdef CONFIG_X86_64 > + VM_ENTRY_IA32E_MODE | > +#endif > + VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | > VM_ENTRY_LOAD_IA32_PAT | > VM_ENTRY_LOAD_IA32_EFER | > VM_ENTRY_LOAD_BNDCFGS | > > but it is 1) not sufficient because some controls are smartly filtered > out just because we don't want them for L1 -- and this doesn't mean that > L2 doesn't need them and 2) because if we add some 'opt' controls to > setup_vmcs_config() we need to filter them out somewhere else. > > I'm starting to think we may just want to store raw VMX MSR values in > vmcs_config first, then sanitize them (eVMCS, vmx preemtoion timer bug, > perf_ctrl bug,..) and then do the adjust_vmx_controls() magic. > > I'm not giving up yet but don't expect something small and backportable > to stable :-) How about we do something simple like the patch below to start with? This will easily apply to stable and we can continue improving upon it with follow up patches on mainline. diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index f5cb18e00e78..f88d748c7cc6 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -6564,6 +6564,10 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) msrs->pinbased_ctls_high); msrs->pinbased_ctls_low |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR; +#if IS_ENABLED(CONFIG_HYPERV) + if (static_branch_unlikely(&enable_evmcs)) + msrs->pinbased_ctls_high &= ~EVMCS1_UNSUPPORTED_PINCTRL; +#endif msrs->pinbased_ctls_high &= PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING | @@ -6580,6 +6584,10 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) msrs->exit_ctls_low = VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; +#if IS_ENABLED(CONFIG_HYPERV) + if (static_branch_unlikely(&enable_evmcs)) + msrs->exit_ctls_high &= ~EVMCS1_UNSUPPORTED_VMEXIT_CTRL; +#endif msrs->exit_ctls_high &= #ifdef CONFIG_X86_64 VM_EXIT_HOST_ADDR_SPACE_SIZE | @@ -6600,6 +6608,10 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) msrs->entry_ctls_high); msrs->entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; +#if IS_ENABLED(CONFIG_HYPERV) + if (static_branch_unlikely(&enable_evmcs)) + msrs->entry_ctls_high &= ~EVMCS1_UNSUPPORTED_VMENTRY_CTRL; +#endif msrs->entry_ctls_high &= #ifdef CONFIG_X86_64 VM_ENTRY_IA32E_MODE | @@ -6657,6 +6669,10 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) msrs->secondary_ctls_high); msrs->secondary_ctls_low = 0; +#if IS_ENABLED(CONFIG_HYPERV) + if (static_branch_unlikely(&enable_evmcs)) + msrs->secondary_ctls_high &= ~EVMCS1_UNSUPPORTED_2NDEXEC; +#endif msrs->secondary_ctls_high &= SECONDARY_EXEC_DESC | SECONDARY_EXEC_ENABLE_RDTSCP |