Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported

Jim Mattson <jmattson@xxxxxxxxxx> · Wed, 20 Dec 2023 05:39:50 -0800

On Tue, Dec 19, 2023 at 7:26 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Tue, Dec 19, 2023, Chao Gao wrote:
> > On Mon, Dec 18, 2023 at 07:40:11PM -0800, Jim Mattson wrote:
> > >Honestly, I think KVM should just disable EPT if the EPT tables can't
> > >support the CPU's physical address width.
> >
> > Yes, it is an option.
> > But I prefer to allow admin to override this (i.e., admin still can enable EPT
> > via module parameter) because those issues are not new and disabling EPT
> > doesn't prevent QEMU from launching guests w/ smaller MAXPHYADDR.
> >
> > >> Here nothing visible to selftests or QEMU indicates that guest.MAXPHYADDR = 52
> > >> is invalid/incorrect. how can we say selftests are at fault and we should fix
> > >> them?
> > >
> > >In this case, the CPU is at fault, and you should complain to the CPU vendor.
> >
> > Yeah, I agree with you and will check with related team inside Intel.
>
> I agree that the CPU is being weird, but this is technically an architecturally
> legal configuration, and KVM has largely committed to supporting weird setups.
> At some point we have to draw a line when things get too ridiculous, but I don't
> think this particular oddity crosses into absurd territory.
>
> > My point was just this isn't a selftest issue because not all information is
> > disclosed to the tests.
>
> Ah, right, EPT capabilities are in MSRs that userspace can't read.
>
> > And I am afraid KVM as L1 VMM may run into this situation, i.e., only 4-level
> > EPT is supported but MAXPHYADDR is 52. So, KVM needs a fix anyway.
>
> Yes, but forcing emulation for a funky setup is not a good fix.  KVM can simply
> constrain the advertised MAXPHYADDR, no?

This is essentially the same as allow_smaller_maxphyaddr, and has the
same issues:
1) MOV-to-CR3 must be intercepted in PAE mode, to check the bits above
guest.MAXPHYADDR in the PDPTRs.
2) #PF must be intercepted whenever PTEs are 64 bits, to do a software
page walk of the guest's x86 page tables and check the bits above
guest.MAXPHYADDR for a terminal page fault that might result in a
different error code than the one produced by hardware.
3) EPT violations may require instruction emulation to synthesize an
appropriate #PF, and this requires the KVM instruction emulator to be
expanded to understand *all* memory-accessing instructions, not just
the limited set it understands now.

I just don't see how this is even remotely practical.

> ---
>  arch/x86/kvm/cpuid.c   | 17 +++++++++++++----
>  arch/x86/kvm/mmu.h     |  2 ++
>  arch/x86/kvm/mmu/mmu.c |  5 +++++
>  3 files changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 294e5bd5f8a0..5c346e1a10bd 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1233,12 +1233,21 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>                  *
>                  * If TDP is enabled but an explicit guest MAXPHYADDR is not
>                  * provided, use the raw bare metal MAXPHYADDR as reductions to
> -                * the HPAs do not affect GPAs.
> +                * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
> +                * doesn't support 5-level paging, cap guest MAXPHYADDR at 48
> +                * bits as KVM can't install SPTEs for larger GPAs.
>                  */
> -               if (!tdp_enabled)
> +               if (!tdp_enabled) {
>                         g_phys_as = boot_cpu_data.x86_phys_bits;
> -               else if (!g_phys_as)
> -                       g_phys_as = phys_as;
> +               } else {
> +                       u8 max_tdp_level = kvm_mmu_get_max_tdp_level();
> +
> +                       if (!g_phys_as)
> +                               g_phys_as = phys_as;
> +
> +                       if (max_tdp_level < 5)
> +                               g_phys_as = min(g_phys_as, 48);
> +               }
>
>                 entry->eax = g_phys_as | (virt_as << 8);
>                 entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 60f21bb4c27b..b410a227c601 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
>         return boot_cpu_data.x86_phys_bits;
>  }
>
> +u8 kvm_mmu_get_max_tdp_level(void);
> +
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
>  void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
>  void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c844e428684..b2845f5520b3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5267,6 +5267,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
>         return max_tdp_level;
>  }
>
> +u8 kvm_mmu_get_max_tdp_level(void)
> +{
> +       return tdp_root_level ? tdp_root_level : max_tdp_level;
> +}
> +
>  static union kvm_mmu_page_role
>  kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
>                                 union kvm_cpu_role cpu_role)
>
> base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
> --