On Tue, Dec 19, 2023 at 7:26 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Tue, Dec 19, 2023, Chao Gao wrote: > > On Mon, Dec 18, 2023 at 07:40:11PM -0800, Jim Mattson wrote: > > >Honestly, I think KVM should just disable EPT if the EPT tables can't > > >support the CPU's physical address width. > > > > Yes, it is an option. > > But I prefer to allow admin to override this (i.e., admin still can enable EPT > > via module parameter) because those issues are not new and disabling EPT > > doesn't prevent QEMU from launching guests w/ smaller MAXPHYADDR. > > > > >> Here nothing visible to selftests or QEMU indicates that guest.MAXPHYADDR = 52 > > >> is invalid/incorrect. how can we say selftests are at fault and we should fix > > >> them? > > > > > >In this case, the CPU is at fault, and you should complain to the CPU vendor. > > > > Yeah, I agree with you and will check with related team inside Intel. > > I agree that the CPU is being weird, but this is technically an architecturally > legal configuration, and KVM has largely committed to supporting weird setups. > At some point we have to draw a line when things get too ridiculous, but I don't > think this particular oddity crosses into absurd territory. > > > My point was just this isn't a selftest issue because not all information is > > disclosed to the tests. > > Ah, right, EPT capabilities are in MSRs that userspace can't read. > > > And I am afraid KVM as L1 VMM may run into this situation, i.e., only 4-level > > EPT is supported but MAXPHYADDR is 52. So, KVM needs a fix anyway. > > Yes, but forcing emulation for a funky setup is not a good fix. KVM can simply > constrain the advertised MAXPHYADDR, no? This is essentially the same as allow_smaller_maxphyaddr, and has the same issues: 1) MOV-to-CR3 must be intercepted in PAE mode, to check the bits above guest.MAXPHYADDR in the PDPTRs. 2) #PF must be intercepted whenever PTEs are 64 bits, to do a software page walk of the guest's x86 page tables and check the bits above guest.MAXPHYADDR for a terminal page fault that might result in a different error code than the one produced by hardware. 3) EPT violations may require instruction emulation to synthesize an appropriate #PF, and this requires the KVM instruction emulator to be expanded to understand *all* memory-accessing instructions, not just the limited set it understands now. I just don't see how this is even remotely practical. > --- > arch/x86/kvm/cpuid.c | 17 +++++++++++++---- > arch/x86/kvm/mmu.h | 2 ++ > arch/x86/kvm/mmu/mmu.c | 5 +++++ > 3 files changed, 20 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c > index 294e5bd5f8a0..5c346e1a10bd 100644 > --- a/arch/x86/kvm/cpuid.c > +++ b/arch/x86/kvm/cpuid.c > @@ -1233,12 +1233,21 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) > * > * If TDP is enabled but an explicit guest MAXPHYADDR is not > * provided, use the raw bare metal MAXPHYADDR as reductions to > - * the HPAs do not affect GPAs. > + * the HPAs do not affect GPAs. Finally, if TDP is enabled and > + * doesn't support 5-level paging, cap guest MAXPHYADDR at 48 > + * bits as KVM can't install SPTEs for larger GPAs. > */ > - if (!tdp_enabled) > + if (!tdp_enabled) { > g_phys_as = boot_cpu_data.x86_phys_bits; > - else if (!g_phys_as) > - g_phys_as = phys_as; > + } else { > + u8 max_tdp_level = kvm_mmu_get_max_tdp_level(); > + > + if (!g_phys_as) > + g_phys_as = phys_as; > + > + if (max_tdp_level < 5) > + g_phys_as = min(g_phys_as, 48); > + } > > entry->eax = g_phys_as | (virt_as << 8); > entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8)); > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 60f21bb4c27b..b410a227c601 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void) > return boot_cpu_data.x86_phys_bits; > } > > +u8 kvm_mmu_get_max_tdp_level(void); > + > void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask); > void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask); > void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only); > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 3c844e428684..b2845f5520b3 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -5267,6 +5267,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu) > return max_tdp_level; > } > > +u8 kvm_mmu_get_max_tdp_level(void) > +{ > + return tdp_root_level ? tdp_root_level : max_tdp_level; > +} > + > static union kvm_mmu_page_role > kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu, > union kvm_cpu_role cpu_role) > > base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6 > --