Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 20 Dec 2023 08:28:06 -0800

As evidenced by my initial response, the shortlog is a bit misleading.  In non-KVM
code it would be perfectly ok to say "limit", but in KVM's split world where
*userspace* is mostly responsible for the guest configuration, "limit guest ..."
is often going to be read as "limit the capabilities of the guest".

This is also a good example of why shortlogs and changelogs should NOT be play-by-play
descriptions of the code change.  The literal code change applies a limit to
guest physical bits, but that doesn't capture the net effect of applying said limit.

Something like

  KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported

better captures that the patch affects what KVM advertises to userspace.  Yeah,
it's technically inaccurate to say "52-bit", but I think 52-bit MAXPHYADDR is
ubiquitous enough that it's worth being technically wrong in order to clearly
communicate the net effect.  Alternatively, it could be something like:

  KVM: x86: Don't advertise incompatible MAXPHYADDR if 5-level TDP is unsupported

On Mon, Dec 18, 2023, Tao Su wrote:
> When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> address must all be zero, otherwise an EPT violation always occurs and
> current handler can't resolve this if the gpa is in RAM region. Hence,
> instruction will keep being executed repeatedly, which causes infinite
> EPT violation.
> 
> Six KVM selftests are timeout due to this issue:
>     kvm:access_tracking_perf_test
>     kvm:demand_paging_test
>     kvm:dirty_log_test
>     kvm:dirty_log_perf_test
>     kvm:kvm_page_table_test
>     kvm:memslot_modification_stress_test
> 
> The above selftests add a RAM region close to max_gfn, if host has 52
> physical bits but doesn't support 5-level EPT, these will trigger infinite
> EPT violation when access the RAM region.
> 
> Since current Intel CPUID doesn't report max guest physical bits like AMD,
> introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> enabled and report the max guest physical bits which is smaller than host.
> 
> When guest physical bits is smaller than host, some GPA are illegal from
> guest's perspective, but are still legal from hardware's perspective,
> which should be trapped to inject #PF. Current KVM already has a parameter
> allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> is enabled and guest accesses an illegal address from guest's perspective,
> KVM will utilize EPT violation and emulate the instruction to inject #PF
> and determine #PF error code.

There is far too much unnecessary cruft in this changelog.  The entire last
paragraph is extraneous information.  Talking in detail about KVM's (flawed)
emulation of smaller guest.MAXPHYADDR isn't at all relevant as to whether or not
it's correct for KVM to advertise an impossible configuration.

And this is exactly why I dislike the "explain the symptom, then the solution"
style for KVM.  The symptoms described above don't actually explain why *KVM* is
at fault.

  KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported

  Cap the effective guest.MAXPHYADDR that KVM advertises to userspace at
  48 bits if 5-level TDP isn't supported, as bits 51:49 are consumed by the
  CPU during a page table walk if and only if 5-level TDP is enabled.  I.e.
  it's impossible for KVM to actually map GPAs with bits 51:49 set, which
  results in vCPUs getting stuck on endless EPT violations.

  From Intel's SDM:

    4-level EPT accesses at most 4 EPT paging-structure entries (an EPT page-
    walk length of 4) to translate a guest-physical address and uses only
    bits 47:0 of each guest-physical address. In contrast, 5-level EPT may
    access up to 5 EPT paging-structure entries (an EPT page-walk length of 5)
    and uses guest-physical address bits 56:0. [Physical addresses and
    guest-physical addresses are architecturally limited to 52 bits (e.g.,
    by paging), so in practice bits 56:52 are zero.]

  While it's decidedly odd for a CPU to support a 52-bit MAXPHYADDR but
  not 5-level EPT, the combination is architecturally legal and such CPUs
  do exist (and can easily be "created" with nested virtualization).

  Note, because EPT capabilities are reported via MSRs, it's impossible
  for userspace to avoid the funky setup, i.e. advertising a sane MAXPHYADDR
  is 100% KVM's responsibility.

> Reported-by: Yi Lai <yi1.lai@xxxxxxxxx>
> Signed-off-by: Tao Su <tao1.su@xxxxxxxxxxxxxxx>
> Tested-by: Yi Lai <yi1.lai@xxxxxxxxx>
> Tested-by: Xudong Hao <xudong.hao@xxxxxxxxx>
> ---
>  arch/x86/kvm/cpuid.c   | 5 +++--
>  arch/x86/kvm/mmu.h     | 1 +
>  arch/x86/kvm/mmu/mmu.c | 7 +++++++
>  3 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index dda6fc4cfae8..91933ca739ad 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1212,12 +1212,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  		 *
>  		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
>  		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> -		 * the HPAs do not affect GPAs.
> +		 * the HPAs do not affect GPAs, but ensure guest MAXPHYADDR
> +		 * doesn't exceed the bits that TDP can translate.
>  		 */
>  		if (!tdp_enabled)
>  			g_phys_as = boot_cpu_data.x86_phys_bits;
>  		else if (!g_phys_as)
> -			g_phys_as = phys_as;
> +			g_phys_as = min(phys_as, kvm_mmu_tdp_maxphyaddr());

I think KVM should be paranoid and cap the advertised MAXPHYADDR even if the CPU
advertises guest.MAXPHYADDR.  Advertising a bad guest.MAXPHYADDR is arguably a
blatant CPU bug, but being paranoid is cheap in this case.

>  		entry->eax = g_phys_as | (virt_as << 8);
>  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index bb8c86eefac0..1c7d649fcf6b 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -115,6 +115,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  				u64 fault_address, char *insn, int insn_len);
>  void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
>  					struct kvm_mmu *mmu);
> +unsigned int kvm_mmu_tdp_maxphyaddr(void);
>  
>  int kvm_mmu_load(struct kvm_vcpu *vcpu);
>  void kvm_mmu_unload(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c57e181bba21..72634d6b61b2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
>  	reset_guest_paging_metadata(vcpu, mmu);
>  }
>  
> +/* guest-physical-address bits limited by TDP */
> +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> +{
> +	return max_tdp_level == 5 ? 57 : 48;

Using "57" is kinda sorta wrong, e.g. the SDM says:

  Bits 56:52 of each guest-physical address are necessarily zero because
  guest-physical addresses are architecturally limited to 52 bits.

Rather than split hairs over something that doesn't matter, I think it makes sense
for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
is still accurate when tdp_root_level is non-zero).

That also avoids confusion with kvm_mmu_max_gfn(), which deliberately uses the
*host* MAXPHYADDR.

Making max_tdp_level visible isn't ideal, but tdp_enabled is already visible and
used by the CPUID code, so it's not the end of the world.

> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_tdp_maxphyaddr);

This shouldn't be exported, I don't see any reason for vendor code to need access
to this helper.  It's essentially a moot point though if we avoid the helper in
the first place.

All in all, this?

---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/cpuid.c            | 16 ++++++++++++----
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7bc1daf68741..29b575b86912 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1932,6 +1932,7 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
 			     bool mask);
 
 extern bool tdp_enabled;
+extern int max_tdp_level;
 
 u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 294e5bd5f8a0..637a1f388a51 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1233,12 +1233,20 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		 *
 		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
 		 * provided, use the raw bare metal MAXPHYADDR as reductions to
-		 * the HPAs do not affect GPAs.
+		 * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
+		 * doesn't support 5-level TDP, cap guest MAXPHYADDR at 48 bits
+		 * as bits 51:49 are used by the CPU if and only if 5-level TDP
+		 * is enabled, i.e. KVM can't map such GPAs with 4-level TDP.
 		 */
-		if (!tdp_enabled)
+		if (!tdp_enabled) {
 			g_phys_as = boot_cpu_data.x86_phys_bits;
-		else if (!g_phys_as)
-			g_phys_as = phys_as;
+		} else {
+			if (!g_phys_as)
+				g_phys_as = phys_as;
+
+			if (max_tdp_level < 5)
+				g_phys_as = min(g_phys_as, 48);
+		}
 
 		entry->eax = g_phys_as | (virt_as << 8);
 		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3c844e428684..5036c7eb7dac 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -114,7 +114,7 @@ module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444);
 
 static int max_huge_page_level __read_mostly;
 static int tdp_root_level __read_mostly;
-static int max_tdp_level __read_mostly;
+int max_tdp_level __read_mostly;
 
 #define PTE_PREFETCH_NUM		8
 

base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
--