On Thu, Jan 04, 2024, Jim Mattson wrote: > On Thu, Jan 4, 2024 at 7:08 AM Chao Gao <chao.gao@xxxxxxxxx> wrote: > > > > On Wed, Jan 03, 2024 at 07:40:02PM -0800, Jim Mattson wrote: > > >On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@xxxxxxxxx> wrote: > > >> > > >> On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote: > > >> >On Tue, Jan 02, 2024, Jim Mattson wrote: > > >> >> This is all just so broken and wrong. The only guest.MAXPHYADDR that > > >> >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to > > >> >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept > > >> >> every #PF, > > >> > > >> in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only > > >> 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with > > >> RSVD bits 51-48 set leads to EPT violation. > > > > > >At the completion of the page table walk, if there is a permission > > >fault, the data address should not be accessed, so there should not be > > >an EPT violation. Remember Meltdown? > > > > You are right. I missed this case. KVM needs to intercept #PF to set RSVD bit > > in PFEC. > > I have no problem with a user deliberately choosing an unsupported > configuration, but I do have a problem with KVM_GET_SUPPORTED_CPUID > returning an unsupported configuration. +1 Advertising guest.MAXPHYADDR < host.MAXPHYADDR in KVM_GET_SUPPORTED_CPUID simply isn't viable when TDP is enabled. I suppose KVM do so when allow_smaller_maxphyaddr is enabled, but that's just asking for confusion, e.g. if userspace reflects the CPUID back into the guest, it could unknowingly create a VM that depends on allow_smaller_maxphyaddr. I think the least awful option is to have the kernel expose whether or not the CPU support 5-level EPT to userspace. That doesn't even require new uAPI per se, just a new flag in /proc/cpuinfo. It'll be a bit gross for userspace to parse, but it's not the end of the world. Alternatively, KVM could add a capability to enumerate the max *addressable* GPA, but userspace would still need to manually take action when KVM can't address all of memory, i.e. a capability would be less ugly, but wouldn't meaningfully change userspace's responsibilities. I.e. diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h index c6a7eed03914..266daf5b5b84 100644 --- a/arch/x86/include/asm/vmxfeatures.h +++ b/arch/x86/include/asm/vmxfeatures.h @@ -25,6 +25,7 @@ #define VMX_FEATURE_EPT_EXECUTE_ONLY ( 0*32+ 17) /* "ept_x_only" EPT entries can be execute only */ #define VMX_FEATURE_EPT_AD ( 0*32+ 18) /* EPT Accessed/Dirty bits */ #define VMX_FEATURE_EPT_1GB ( 0*32+ 19) /* 1GB EPT pages */ +#define VMX_FEATURE_EPT_5LEVEL ( 0*32+ 20) /* 5-level EPT paging */ /* Aggregated APIC features 24-27 */ #define VMX_FEATURE_FLEXPRIORITY ( 0*32+ 24) /* TPR shadow + virt APIC */ diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c index 03851240c3e3..1640ae76548f 100644 --- a/arch/x86/kernel/cpu/feat_ctl.c +++ b/arch/x86/kernel/cpu/feat_ctl.c @@ -72,6 +72,8 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c) c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_AD); if (ept & VMX_EPT_1GB_PAGE_BIT) c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_1GB); + if (ept & VMX_EPT_PAGE_WALK_5_BIT) + c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_5LEVEL); /* Synthetic APIC features that are aggregates of multiple features. */ if ((c->vmx_capability[PRIMARY_CTLS] & VMX_F(VIRTUAL_TPR)) &&