Re: [PATCH] kvm: x86: Fix several SPTE mask calculation errors caused by MKTME

"Huang, Kai" <kai.huang@xxxxxxxxx> · Tue, 23 Apr 2019 23:38:17 +0000

On Tue, 2019-04-23 at 08:08 -0700, Sean Christopherson wrote:
> On Mon, Apr 22, 2019 at 06:57:01PM -0700, Huang, Kai wrote:
> > On Mon, 2019-04-22 at 09:39 -0700, Sean Christopherson wrote:
> > > On Tue, Apr 16, 2019 at 09:19:48PM +1200, Kai Huang wrote:
> > > > With both Intel MKTME and AMD SME/SEV, physical address bits are reduced
> > > > due to several high bits of physical address are repurposed for memory
> > > > encryption. To honor such behavior those repurposed bits are reduced from
> > > > cpuinfo_x86->x86_phys_bits for both Intel MKTME and AMD SME/SEV, thus
> > > > boot_cpu_data.x86_phys_bits doesn't hold physical address bits reported
> > > > by CPUID anymore.
> > > 
> > > This neglects to mention the most relevant tidbit of information in terms
> > > of justification for this patch: the number of bits stolen for MKTME is
> > > programmed by BIOS, i.e. bits may be repurposed for MKTME regardless of
> > > kernel support.
> > 
> > I can add BIOS part. But the key issue is kernel adjusts
> > boot_cpu_data.x86_phys_bits, isn't it?
> > 
> > If kernel doesn't adjust boot_cpu_data.x86_phys_bits then this patch
> > theoretically is not needed?
> 
> True, but the context matters, e.g. readers might wonder why this code
> doesn't simply check a feature flag to see if MKTME is enabled.  Knowing
> that PA bits can be repurposed regardless of (full) kernel support is just
> as important as knowing that the kernel adjusts boot_cpu_data.x86_phys_bits.

Fine.

[...]

> > > 
> > > > +
> > > >  
> > > >  static void mmu_spte_set(u64 *sptep, u64 spte);
> > > >  static union kvm_mmu_page_role
> > > > @@ -303,6 +319,34 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
> > > >  
> > > > +void kvm_set_mmio_spte_mask(void)
> > > 
> > > Moving this to mmu.c makes sense, but calling it from kvm_arch_init() is
> > > silly.  The current call site is immediately after kvm_mmu_module_init(),
> > > I don't see any reason not to just move the call there.
> > 
> > I don't know whether calling kvm_set_mmio_spte_mask() from kvm_arch_init() is
> > silly but maybe there's histroy -- KVM calls kvm_mmu_set_mask_ptes() from
> > kvm_arch_init() too, which may also be silly according to your judgement. I
> > have no problem calling kvm_set_mmio_spte_mask() from kvm_mmu_module_init(),
> > but IMHO this logic is irrelevant to this patch, and it's better to have a
> > separate patch for this purpose if necessary?
> 
> A separate patch would be fine, but I would do it as a prereq, i.e. move
> the function first and modify it second.  That would help review the
> functional changes.

I think I am still a little bit confused. Assuming you want to call kvm_set_mmio_spte_mask() from
kvm_mmu_module_init(), right after kvm_mmu_reset_all_pte_masks() is called, does this change do any
real fix related to this issue? Or exactly what are you suggesting if I am misunderstanding?

Btw I think calling kvm_set_mmio_spte_mask() from kvm_arch_init() has benefits, since we can change
to call it after kvm_x86_ops has been setup, and then we can do some kvm_x86_ops->xxx() if needed
inside.

And we also have kvm_mmu_set_mask_ptes(), which seems to be similar thing as
kvm_set_mmio_spte_mask(). I don't see why we should call kvm_set_mmio_spte_mask() in
kvm_mmu_module_init() but leave kvm_mmu_set_mask_ptes() outside. 

[...]

> > > 
> > 
> > > {MK}TME provides a CPUID feature bit and MSRs to query support, and this
> > > code only runs at module init so the overhead of a RDMSR is negligible. 
> > 
> > Yes can check against CPUID feature bit but not sure why it is better than CPU vendor.
> 
> It helps readers understand the code flow, i.e. it's a mental cue that the
> bit stealing only applies to MKTME.  Checking for "Intel" could easiliy be
> misinterpreted as "x86_phys_bits isn't accurate for Intel CPUs."

OK. Fine to me.

> 
> > > > +		shadow_first_rsvd_bit = boot_cpu_data.x86_phys_bits;
> > > > +	else
> > > > +		shadow_first_rsvd_bit = kvm_get_cpuid_phys_bits();
> > > 
> > > If you rename the helper to be less specific and actually check for MKTME
> > > support then the MKTME comment doesn't need to be as verbose, e.g.: 
> > > 
> > > static u8 kvm_get_shadow_phys_bits(void)
> > > {
> > > 	if (!<has MKTME> ||
> > > 	    WARN_ON_ONCE(boot_cpu_data.extended_cpuid_level < 0x80000008))
> > > 		return boot_cpu_data.x86_phys_bits;
> > 
> > Why do we need WARN_ON_ONCE here?
> 
> Because KVM would essentially be consuming known bad data since we've
> already established that 'x86_phys_bits' is wrong when MKTME is enabled.
> I.e. we should never encounter a platform with MKTME but not CPUID leaf
> 0x80000008.

Can you make assumption that compiler won't do WARN_ON_ONCE part before checking !<has_mktme>?

> 
> > I don't have problem using as you suggested, but I don't get why checking
> > against CPU vendor is last resort?
> 
> A few reasons of the top of my head:
> 
>   - CPU vendor is less precise, e.g. by checking for MKTME support KVM can
>     WARN if it's consuming known bad data. 
> 
>   - Checking for features makes the code self-documenting to some extent.
> 
>   - Multiple vendors may support a feature, now or in the future.  E.g. the
>     Hygon case is a great example.

OK. 

> 
> > > 
> > > 	/*
> > > 	 * MKTME steals physical address bits for key IDs, but the key ID bits
> > > 	 * are not treated as reserved.  x86_phys_bits is adjusted to account
> > > 	 * for the stolen bits, use CPUID.MAX_PA_WIDTH directly which reports
> > > 	 * the number of software-available bits irrespective of MKTME.
> > > 	 */
> > > 	return cpuid_eax(0x80000008) & 0xff;
> > > }
> > > 
> > > > +
> > > > +	/*
> > > > +	 * Only Intel is impacted by L1TF, therefore for AMD and other x86
> > > > +	 * vendors L1TF mitigation is not needed.
> > > > +	 *
> > > > +	 * For Intel CPU, if it has 46 or less physical address bits, then set
> > > > +	 * an appropriate mask to guard against L1TF attacks. Otherwise, it is
> > > >  	 * assumed that the CPU is not vulnerable to L1TF.
> > > >  	 */
> > > > +	if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) &&
> > > 
> > > Again, checking vendors is bad.  Not employing a mitigation technique due
> > > to an assumption about what processors are affected by a vulnerability is
> > > even worse.  
> > 
> > Why, isn't it a fact that L1TF only impacts Intel CPUs? What's the benefit of
> > employing a mitigation to CPUs that don't have L1TF issue? Such mitigation
> > only makes performance worse, even not noticable. For example, why not
> > employing such mitigation regardless to physical address bits at all?
> 
> This particular mitigation has minimal effect on performance, e.g. adds
> a few SHL/SHR/AND/OR operations, and that minimal overhead is incurred
> regardless of whether or not reserved bits are set.  I.e. KVM is paying
> the penalty no matter what, so being extra paranoid is "free".
> 
> > > The existing code makes the assumption regarding processors
> > > with >46 bits of address space because a) no such processor existed before
> > > the discovery of L1TF, and it's reasonable to assume hardware vendors
> > > won't ship future processors with such an obvious vulnerability, and b)
> > > hardcoding the number of reserved bits to set simplifies the code.
> > 
> > Yes, but we cannot simply use 'shadow_phys_bits' to check against 46 anymore,
> > right?
> > 
> > For example, if AMD has 52 phys bits, but it reduces 5 or more bits, then
> > current KVM code would employ l1tf mitigation, but actually it really
> > shouldn't?
> 
> What do you mean by "shouldn't"?  Sure, it's not absolutely necessary, but
> again setting the bits is free since adjusting the GPA is hardcoded into
> mark_mmio_spte() and get_mmio_spte_gfn().

OK. I am not sure whether CPU has any difference when doing x << 5 and x << 0 but this is really
trival so will do what you suggested.

Or I found there's one CPU bug flag (software) X86_BUG_L1TF, do you think we should check whether
boot_cpu_has(X86_BUG_L1TF), and don't setup shadow_nonpresent_or_rsvd_mask if it doesn't?

> 
> > > > +			(shadow_first_rsvd_bit <
> > > > +				52 - shadow_nonpresent_or_rsvd_mask_len))
> > > > +		need_l1tf = true;
> > > > +	else
> > > > +		need_l1tf = false;
> > > > +
> > > >  	low_phys_bits = boot_cpu_data.x86_phys_bits;
> > > > -	if (boot_cpu_data.x86_phys_bits <
> > > > -	    52 - shadow_nonpresent_or_rsvd_mask_len) {
> > > > +	shadow_nonpresent_or_rsvd_mask = 0;
> > > > +	if (need_l1tf) {
> > > >  		shadow_nonpresent_or_rsvd_mask =
> > > >  			rsvd_bits(boot_cpu_data.x86_phys_bits -
> > > >  				  shadow_nonpresent_or_rsvd_mask_len,
> > > 
> > > This is broken, the reserved bits mask is being calculated with the wrong
> > > number of physical bits.  I think fixing this would eliminate the need for
> > > the high_gpa_offset shenanigans.
> > 
> > You are right. should use 'shadow_phys_bits' instead. Thanks. Let me think whether
> > high_gpa_offset
> > can be avoided.
> > 
> > > 
> > > > @@ -4326,7 +4422,7 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu,
> > > >  static void
> > > >  __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> > > >  			struct rsvd_bits_validate *rsvd_check,
> > > > -			int maxphyaddr, int level, bool nx, bool gbpages,
> > > > +			int first_rsvd_bit, int level, bool nx, bool gbpages,
> > > >  			bool pse, bool amd)
> > > 
> > > Similar to the earlier comment regarding 'first', it's probably less
> > > confusing overall to just leave this as 'maxphyaddr'.
> > 
> > Both work for me. But maybe 'non_rsvd_maxphyaddr' is better? 
> 
> Maybe?  My personal preference would be to stay with maxphyaddr.

Fine to maxphyaddr. Btw, which one do you think is better, shadow_phys_bits, or more explicitly,
shadow_non_rsvd_phys_bits?

Thanks,
-Kai