Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported

Tao Su <tao1.su@xxxxxxxxxxxxxxx> · Thu, 21 Dec 2023 15:45:15 +0800

On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> As evidenced by my initial response, the shortlog is a bit misleading.  In non-KVM
> code it would be perfectly ok to say "limit", but in KVM's split world where
> *userspace* is mostly responsible for the guest configuration, "limit guest ..."
> is often going to be read as "limit the capabilities of the guest".
> 
> This is also a good example of why shortlogs and changelogs should NOT be play-by-play
> descriptions of the code change.  The literal code change applies a limit to
> guest physical bits, but that doesn't capture the net effect of applying said limit.
> 
> Something like
> 
>   KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported
> 
> better captures that the patch affects what KVM advertises to userspace.  Yeah,
> it's technically inaccurate to say "52-bit", but I think 52-bit MAXPHYADDR is
> ubiquitous enough that it's worth being technically wrong in order to clearly
> communicate the net effect.  Alternatively, it could be something like:
> 
>   KVM: x86: Don't advertise incompatible MAXPHYADDR if 5-level TDP is unsupported

Yeah, these two shortlogs are more accurate and intuitive.

> 
> On Mon, Dec 18, 2023, Tao Su wrote:
> > When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> > address must all be zero, otherwise an EPT violation always occurs and
> > current handler can't resolve this if the gpa is in RAM region. Hence,
> > instruction will keep being executed repeatedly, which causes infinite
> > EPT violation.
> > 
> > Six KVM selftests are timeout due to this issue:
> >     kvm:access_tracking_perf_test
> >     kvm:demand_paging_test
> >     kvm:dirty_log_test
> >     kvm:dirty_log_perf_test
> >     kvm:kvm_page_table_test
> >     kvm:memslot_modification_stress_test
> > 
> > The above selftests add a RAM region close to max_gfn, if host has 52
> > physical bits but doesn't support 5-level EPT, these will trigger infinite
> > EPT violation when access the RAM region.
> > 
> > Since current Intel CPUID doesn't report max guest physical bits like AMD,
> > introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> > enabled and report the max guest physical bits which is smaller than host.
> > 
> > When guest physical bits is smaller than host, some GPA are illegal from
> > guest's perspective, but are still legal from hardware's perspective,
> > which should be trapped to inject #PF. Current KVM already has a parameter
> > allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> > host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> > can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> > is enabled and guest accesses an illegal address from guest's perspective,
> > KVM will utilize EPT violation and emulate the instruction to inject #PF
> > and determine #PF error code.
> 
> There is far too much unnecessary cruft in this changelog.  The entire last
> paragraph is extraneous information.  Talking in detail about KVM's (flawed)
> emulation of smaller guest.MAXPHYADDR isn't at all relevant as to whether or not
> it's correct for KVM to advertise an impossible configuration.
> 
> And this is exactly why I dislike the "explain the symptom, then the solution"
> style for KVM.  The symptoms described above don't actually explain why *KVM* is
> at fault.
> 
>   KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported
> 
>   Cap the effective guest.MAXPHYADDR that KVM advertises to userspace at
>   48 bits if 5-level TDP isn't supported, as bits 51:49 are consumed by the
>   CPU during a page table walk if and only if 5-level TDP is enabled.  I.e.
>   it's impossible for KVM to actually map GPAs with bits 51:49 set, which
>   results in vCPUs getting stuck on endless EPT violations.
> 
>   From Intel's SDM:
> 
>     4-level EPT accesses at most 4 EPT paging-structure entries (an EPT page-
>     walk length of 4) to translate a guest-physical address and uses only
>     bits 47:0 of each guest-physical address. In contrast, 5-level EPT may
>     access up to 5 EPT paging-structure entries (an EPT page-walk length of 5)
>     and uses guest-physical address bits 56:0. [Physical addresses and
>     guest-physical addresses are architecturally limited to 52 bits (e.g.,
>     by paging), so in practice bits 56:52 are zero.]
> 
>   While it's decidedly odd for a CPU to support a 52-bit MAXPHYADDR but
>   not 5-level EPT, the combination is architecturally legal and such CPUs
>   do exist (and can easily be "created" with nested virtualization).
> 
>   Note, because EPT capabilities are reported via MSRs, it's impossible
>   for userspace to avoid the funky setup, i.e. advertising a sane MAXPHYADDR
>   is 100% KVM's responsibility.

Yes, I learn a lot, I will use the above shortlogs and changelogs as my
next version.

> 
> > Reported-by: Yi Lai <yi1.lai@xxxxxxxxx>
> > Signed-off-by: Tao Su <tao1.su@xxxxxxxxxxxxxxx>
> > Tested-by: Yi Lai <yi1.lai@xxxxxxxxx>
> > Tested-by: Xudong Hao <xudong.hao@xxxxxxxxx>
> > ---
> >  arch/x86/kvm/cpuid.c   | 5 +++--
> >  arch/x86/kvm/mmu.h     | 1 +
> >  arch/x86/kvm/mmu/mmu.c | 7 +++++++
> >  3 files changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index dda6fc4cfae8..91933ca739ad 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -1212,12 +1212,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> >  		 *
> >  		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
> >  		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> > -		 * the HPAs do not affect GPAs.
> > +		 * the HPAs do not affect GPAs, but ensure guest MAXPHYADDR
> > +		 * doesn't exceed the bits that TDP can translate.
> >  		 */
> >  		if (!tdp_enabled)
> >  			g_phys_as = boot_cpu_data.x86_phys_bits;
> >  		else if (!g_phys_as)
> > -			g_phys_as = phys_as;
> > +			g_phys_as = min(phys_as, kvm_mmu_tdp_maxphyaddr());
> 
> I think KVM should be paranoid and cap the advertised MAXPHYADDR even if the CPU
> advertises guest.MAXPHYADDR.  Advertising a bad guest.MAXPHYADDR is arguably a
> blatant CPU bug, but being paranoid is cheap in this case.

Yes, agree.

> 
> >  		entry->eax = g_phys_as | (virt_as << 8);
> >  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index bb8c86eefac0..1c7d649fcf6b 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -115,6 +115,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> >  				u64 fault_address, char *insn, int insn_len);
> >  void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >  					struct kvm_mmu *mmu);
> > +unsigned int kvm_mmu_tdp_maxphyaddr(void);
> >  
> >  int kvm_mmu_load(struct kvm_vcpu *vcpu);
> >  void kvm_mmu_unload(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index c57e181bba21..72634d6b61b2 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >  	reset_guest_paging_metadata(vcpu, mmu);
> >  }
> >  
> > +/* guest-physical-address bits limited by TDP */
> > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > +{
> > +	return max_tdp_level == 5 ? 57 : 48;
> 
> Using "57" is kinda sorta wrong, e.g. the SDM says:
> 
>   Bits 56:52 of each guest-physical address are necessarily zero because
>   guest-physical addresses are architecturally limited to 52 bits.
> 
> Rather than split hairs over something that doesn't matter, I think it makes sense
> for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> is still accurate when tdp_root_level is non-zero).
> 
> That also avoids confusion with kvm_mmu_max_gfn(), which deliberately uses the
> *host* MAXPHYADDR.
> 
> Making max_tdp_level visible isn't ideal, but tdp_enabled is already visible and
> used by the CPUID code, so it's not the end of the world.

Yes, agree.

> 
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_mmu_tdp_maxphyaddr);
> 
> This shouldn't be exported, I don't see any reason for vendor code to need access
> to this helper.  It's essentially a moot point though if we avoid the helper in
> the first place.

Yes, previously I want to invoke this helper in the patch2, now it is no
longer needed.

> 
> All in all, this?

The code is perfect, I will add it to my next version, thanks!

> 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/cpuid.c            | 16 ++++++++++++----
>  arch/x86/kvm/mmu/mmu.c          |  2 +-
>  3 files changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7bc1daf68741..29b575b86912 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1932,6 +1932,7 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
>  			     bool mask);
>  
>  extern bool tdp_enabled;
> +extern int max_tdp_level;
>  
>  u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
>  
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 294e5bd5f8a0..637a1f388a51 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1233,12 +1233,20 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  		 *
>  		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
>  		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> -		 * the HPAs do not affect GPAs.
> +		 * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
> +		 * doesn't support 5-level TDP, cap guest MAXPHYADDR at 48 bits
> +		 * as bits 51:49 are used by the CPU if and only if 5-level TDP

Nit, it should be bits 51:48?

Thanks,
Tao

> +		 * is enabled, i.e. KVM can't map such GPAs with 4-level TDP.
>  		 */
> -		if (!tdp_enabled)
> +		if (!tdp_enabled) {
>  			g_phys_as = boot_cpu_data.x86_phys_bits;
> -		else if (!g_phys_as)
> -			g_phys_as = phys_as;
> +		} else {
> +			if (!g_phys_as)
> +				g_phys_as = phys_as;
> +
> +			if (max_tdp_level < 5)
> +				g_phys_as = min(g_phys_as, 48);
> +		}
>  
>  		entry->eax = g_phys_as | (virt_as << 8);
>  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c844e428684..5036c7eb7dac 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -114,7 +114,7 @@ module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444);
>  
>  static int max_huge_page_level __read_mostly;
>  static int tdp_root_level __read_mostly;
> -static int max_tdp_level __read_mostly;
> +int max_tdp_level __read_mostly;
>  
>  #define PTE_PREFETCH_NUM		8
>  
> 
> base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
> -- 
>