Re: [PATCH] kvm: x86: Fix several SPTE mask calculation errors caused by MKTME

Sean Christopherson <sean.j.christopherson@xxxxxxxxx> · Tue, 23 Apr 2019 08:08:08 -0700

On Mon, Apr 22, 2019 at 06:57:01PM -0700, Huang, Kai wrote:
> On Mon, 2019-04-22 at 09:39 -0700, Sean Christopherson wrote:
> > On Tue, Apr 16, 2019 at 09:19:48PM +1200, Kai Huang wrote:
> > > With both Intel MKTME and AMD SME/SEV, physical address bits are reduced
> > > due to several high bits of physical address are repurposed for memory
> > > encryption. To honor such behavior those repurposed bits are reduced from
> > > cpuinfo_x86->x86_phys_bits for both Intel MKTME and AMD SME/SEV, thus
> > > boot_cpu_data.x86_phys_bits doesn't hold physical address bits reported
> > > by CPUID anymore.
> > 
> > This neglects to mention the most relevant tidbit of information in terms
> > of justification for this patch: the number of bits stolen for MKTME is
> > programmed by BIOS, i.e. bits may be repurposed for MKTME regardless of
> > kernel support.
> 
> I can add BIOS part. But the key issue is kernel adjusts
> boot_cpu_data.x86_phys_bits, isn't it?
> 
> If kernel doesn't adjust boot_cpu_data.x86_phys_bits then this patch
> theoretically is not needed?

True, but the context matters, e.g. readers might wonder why this code
doesn't simply check a feature flag to see if MKTME is enabled.  Knowing
that PA bits can be repurposed regardless of (full) kernel support is just
as important as knowing that the kernel adjusts boot_cpu_data.x86_phys_bits.

...

> > >  arch/x86/kvm/mmu.c | 152 +++++++++++++++++++++++++++++++++++++++++++----------
> > >  arch/x86/kvm/mmu.h |   1 +
> > >  arch/x86/kvm/x86.c |  29 ----------
> > >  3 files changed, 125 insertions(+), 57 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > index e10962dfc203..3add7283980e 100644
> > > --- a/arch/x86/kvm/mmu.c
> > > +++ b/arch/x86/kvm/mmu.c
> > > @@ -261,6 +261,22 @@ static const u64 shadow_nonpresent_or_rsvd_mask_len = 5;
> > >   */
> > >  static u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
> > >  
> > > +/*
> > > + * The position of first reserved bit. If it equals to 52, then CPU doesn't
> > > + * have any reserved bits, otherwise bits [shadow_first_rsvd_bit, 51] are
> > > + * reserved bits.
> > > + *
> > > + * boot_cpu_data.x86_phys_bits cannot be used to determine the position of
> > > + * reserved bits anymore, since both Intel MKTME and AMD SME/SEV reduce
> > > + * physical address bits, and the reduced bits are taken away from
> > > + * boot_cpu_data.x86_phys_bits to reflect such fact. But Intel MKTME and AMD
> > > + * SME/SEV treat those reduced bits differently -- Intel MKTME treats them
> > > + * as 'keyID' thus not reserved bits, but AMD SME/SEV treats them as reserved
> > > + * bits, thus physical address bits reported by CPUID cannot be used to
> > > + * determine reserved bits position either.
> > > + */
> > 
> > No need to rehash the justification for the variable, it's sufficient to
> > state *what* the variable tracks.  And the whole first paragraph can be
> > dropped if the variable is renamed.
> > 
> > > +static u64 __read_mostly shadow_first_rsvd_bit;
> > 
> > Hmm, 'first' is technically incorrect since EPT has reserved attribute bits
> > on non-leaf entries.  And 'first' is vague, e.g. it could be interpreted as
> > the MSB or LSB.
> > 
> > In the end, we're still tracking the number of physical address bits, so
> > maybe something like 'shadow_phys_bits'?
> > 
> > E.g. putting it together:
> > 
> > /*
> >  * The number of non-reserved physical address bits irrespective of features
> >  * that repurpose software-accessible bits, e.g. MKTME.
> >  */
> > static u64 __read_mostly shadow_phys_bits;
> 
> Fine to me. Will do what you suggested.
> 
> > 
> > > +
> > >  
> > >  static void mmu_spte_set(u64 *sptep, u64 spte);
> > >  static union kvm_mmu_page_role
> > > @@ -303,6 +319,34 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value)
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
> > >  
> > > +void kvm_set_mmio_spte_mask(void)
> > 
> > Moving this to mmu.c makes sense, but calling it from kvm_arch_init() is
> > silly.  The current call site is immediately after kvm_mmu_module_init(),
> > I don't see any reason not to just move the call there.
> 
> I don't know whether calling kvm_set_mmio_spte_mask() from kvm_arch_init() is
> silly but maybe there's histroy -- KVM calls kvm_mmu_set_mask_ptes() from
> kvm_arch_init() too, which may also be silly according to your judgement. I
> have no problem calling kvm_set_mmio_spte_mask() from kvm_mmu_module_init(),
> but IMHO this logic is irrelevant to this patch, and it's better to have a
> separate patch for this purpose if necessary?

A separate patch would be fine, but I would do it as a prereq, i.e. move
the function first and modify it second.  That would help review the
functional changes.

> > > +{
> > > +	u64 mask;
> > > +
> > > +	/*
> > > +	 * Set the reserved bits and the present bit of an paging-structure
> > > +	 * entry to generate page fault with PFER.RSV = 1.
> > > +	 */
> > > +
> > > +	/*
> > > +	 * Mask the uppermost physical address bit, which would be reserved as
> > > +	 * long as the supported physical address width is less than 52.
> > > +	 */
> > > +	mask = 1ull << 51;
> > > +
> > > +	/* Set the present bit. */
> > > +	mask |= 1ull;
> > > +
> > > +	/*
> > > +	 * If reserved bit is not supported, clear the present bit to disable
> > > +	 * mmio page fault.
> > > +	 */
> > > +	if (IS_ENABLED(CONFIG_X86_64) && shadow_first_rsvd_bit == 52)
> > > +		mask &= ~1ull;
> > > +
> > > +	kvm_mmu_set_mmio_spte_mask(mask, mask);
> > > +}
> > > +
> > >  static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
> > >  {
> > >  	return sp->role.ad_disabled;
> > > @@ -384,12 +428,21 @@ static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
> > >  	u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK;
> > >  	u64 mask = generation_mmio_spte_mask(gen);
> > >  	u64 gpa = gfn << PAGE_SHIFT;
> > > +	u64 high_gpa_offset;
> > >  
> > >  	access &= ACC_WRITE_MASK | ACC_USER_MASK;
> > >  	mask |= shadow_mmio_value | access;
> > >  	mask |= gpa | shadow_nonpresent_or_rsvd_mask;
> > > +	/*
> > > +	 * With Intel MKTME, the bits from boot_cpu_data.x86_phys_bits to
> > > +	 * shadow_first_rsvd_bit - 1 are actually keyID bits but not
> > > +	 * reserved bits. We need to put high GPA bits to actual reserved
> > > +	 * bits to mitigate L1TF attack.
> > > +	 */
> > > +	high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len +
> > > +		shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits;
> > >  	mask |= (gpa & shadow_nonpresent_or_rsvd_mask)
> > > -		<< shadow_nonpresent_or_rsvd_mask_len;
> > > +		<< high_gpa_offset;
> > >  
> > >  	page_header(__pa(sptep))->mmio_cached = true;
> > >  
> > > @@ -405,8 +458,11 @@ static bool is_mmio_spte(u64 spte)
> > >  static gfn_t get_mmio_spte_gfn(u64 spte)
> > >  {
> > >  	u64 gpa = spte & shadow_nonpresent_or_rsvd_lower_gfn_mask;
> > > +	/* See comments in mark_mmio_spte */
> > > +	u64 high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len +
> > > +		shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits;
> > 
> > The exact shift needed is constant after init, there should be no need to
> > dynamically calculate it for every MMIO SPTE.
> 
> I can add one more static variable such as 'shadow_high_gfn_offset' and calculate it in
> kvm_mmu_reset_all_pte_masks (if necessary, as you mentioned below)? Is this good to you?

I'm pretty sure we can use shadow_phys_bits directly once the bug in
kvm_mmu_reset_all_pte_masks() is fixed.

> > 
> > > -	gpa |= (spte >> shadow_nonpresent_or_rsvd_mask_len)
> > > +	gpa |= (spte >> high_gpa_offset)
> > >  	       & shadow_nonpresent_or_rsvd_mask;
> > >  
> > >  	return gpa >> PAGE_SHIFT;
> > > @@ -470,9 +526,22 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes);
> > >  
> > > +static u8 kvm_get_cpuid_phys_bits(void)
> > > +{
> > > +	u32 eax, ebx, ecx, edx;
> > > +
> > > +	if (boot_cpu_data.extended_cpuid_level < 0x80000008)
> > > +		return boot_cpu_data.x86_phys_bits;
> > > +
> > > +	cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
> > > +
> > > +	return eax & 0xff;
> > 
> > cpuid_eax() will do most of the work for you.
> 
> Sure will do. thanks.
> 
> > 
> > > +}
> > > +
> > >  static void kvm_mmu_reset_all_pte_masks(void)
> > >  {
> > >  	u8 low_phys_bits;
> > > +	bool need_l1tf;
> > >  
> > >  	shadow_user_mask = 0;
> > >  	shadow_accessed_mask = 0;
> > > @@ -484,13 +553,40 @@ static void kvm_mmu_reset_all_pte_masks(void)
> > >  	shadow_acc_track_mask = 0;
> > >  
> > >  	/*
> > > -	 * If the CPU has 46 or less physical address bits, then set an
> > > -	 * appropriate mask to guard against L1TF attacks. Otherwise, it is
> > > +	 * Calcualte the first reserved bit position. Although both Intel
> > > +	 * MKTME and AMD SME/SEV reduce physical address bits for memory
> > > +	 * encryption (and boot_cpu_data.x86_phys_bits is reduced to reflect
> > > +	 * such fact), they treat those reduced bits differently -- Intel
> > > +	 * MKTME treats those as 'keyID' thus not reserved bits, but AMD
> > > +	 * SME/SEV still treats those bits as reserved bits, so for AMD
> > > +	 * shadow_first_rsvd_bit is boot_cpu_data.x86_phys_bits, but for
> > > +	 * Intel (and other x86 vendors that don't support memory encryption
> > > +	 * at all), shadow_first_rsvd_bit is physical address bits reported
> > > +	 * by CPUID.
> > > +	 */
> > > +	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
> > 
> > Checking for a specific vendor should only be done as a last resort, e.g.
> > I believe this will break Hygon CPUs w/ SME/SEV.  
> 
> This can be easily adjusted in the future. I can add Hygon check to this patch.
> 
> > E.g. the exisiting
> > "is AMD" check in mmu.c looks at shadow_x_mask, not an CPUID vendor.
> 
> shadow_x_mask hasn't been calculated here. It is calcualted later in kvm_mmu_set_mask_ptes() which
> is called from kvm_arch_init() after kvm_x86_ops is setup.
> 
> And does this justify "using CPU vendor check should be as last resort"? There are other kernel code
>  checks CPU vendors too.
> 
> 
> > {MK}TME provides a CPUID feature bit and MSRs to query support, and this
> > code only runs at module init so the overhead of a RDMSR is negligible. 
> 
> Yes can check against CPUID feature bit but not sure why it is better than CPU vendor.

It helps readers understand the code flow, i.e. it's a mental cue that the
bit stealing only applies to MKTME.  Checking for "Intel" could easiliy be
misinterpreted as "x86_phys_bits isn't accurate for Intel CPUs."

> > > +		shadow_first_rsvd_bit = boot_cpu_data.x86_phys_bits;
> > > +	else
> > > +		shadow_first_rsvd_bit = kvm_get_cpuid_phys_bits();
> > 
> > If you rename the helper to be less specific and actually check for MKTME
> > support then the MKTME comment doesn't need to be as verbose, e.g.: 
> > 
> > static u8 kvm_get_shadow_phys_bits(void)
> > {
> > 	if (!<has MKTME> ||
> > 	    WARN_ON_ONCE(boot_cpu_data.extended_cpuid_level < 0x80000008))
> > 		return boot_cpu_data.x86_phys_bits;
> 
> Why do we need WARN_ON_ONCE here?

Because KVM would essentially be consuming known bad data since we've
already established that 'x86_phys_bits' is wrong when MKTME is enabled.
I.e. we should never encounter a platform with MKTME but not CPUID leaf
0x80000008.

> I don't have problem using as you suggested, but I don't get why checking
> against CPU vendor is last resort?

A few reasons of the top of my head:

  - CPU vendor is less precise, e.g. by checking for MKTME support KVM can
    WARN if it's consuming known bad data. 

  - Checking for features makes the code self-documenting to some extent.

  - Multiple vendors may support a feature, now or in the future.  E.g. the
    Hygon case is a great example.

> > 
> > 	/*
> > 	 * MKTME steals physical address bits for key IDs, but the key ID bits
> > 	 * are not treated as reserved.  x86_phys_bits is adjusted to account
> > 	 * for the stolen bits, use CPUID.MAX_PA_WIDTH directly which reports
> > 	 * the number of software-available bits irrespective of MKTME.
> > 	 */
> > 	return cpuid_eax(0x80000008) & 0xff;
> > }
> > 
> > > +
> > > +	/*
> > > +	 * Only Intel is impacted by L1TF, therefore for AMD and other x86
> > > +	 * vendors L1TF mitigation is not needed.
> > > +	 *
> > > +	 * For Intel CPU, if it has 46 or less physical address bits, then set
> > > +	 * an appropriate mask to guard against L1TF attacks. Otherwise, it is
> > >  	 * assumed that the CPU is not vulnerable to L1TF.
> > >  	 */
> > > +	if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) &&
> > 
> > Again, checking vendors is bad.  Not employing a mitigation technique due
> > to an assumption about what processors are affected by a vulnerability is
> > even worse.  
> 
> Why, isn't it a fact that L1TF only impacts Intel CPUs? What's the benefit of
> employing a mitigation to CPUs that don't have L1TF issue? Such mitigation
> only makes performance worse, even not noticable. For example, why not
> employing such mitigation regardless to physical address bits at all?

This particular mitigation has minimal effect on performance, e.g. adds
a few SHL/SHR/AND/OR operations, and that minimal overhead is incurred
regardless of whether or not reserved bits are set.  I.e. KVM is paying
the penalty no matter what, so being extra paranoid is "free".

> > The existing code makes the assumption regarding processors
> > with >46 bits of address space because a) no such processor existed before
> > the discovery of L1TF, and it's reasonable to assume hardware vendors
> > won't ship future processors with such an obvious vulnerability, and b)
> > hardcoding the number of reserved bits to set simplifies the code.
> 
> Yes, but we cannot simply use 'shadow_phys_bits' to check against 46 anymore,
> right?
> 
> For example, if AMD has 52 phys bits, but it reduces 5 or more bits, then
> current KVM code would employ l1tf mitigation, but actually it really
> shouldn't?

What do you mean by "shouldn't"?  Sure, it's not absolutely necessary, but
again setting the bits is free since adjusting the GPA is hardcoded into
mark_mmio_spte() and get_mmio_spte_gfn().

> > > +			(shadow_first_rsvd_bit <
> > > +				52 - shadow_nonpresent_or_rsvd_mask_len))
> > > +		need_l1tf = true;
> > > +	else
> > > +		need_l1tf = false;
> > > +
> > >  	low_phys_bits = boot_cpu_data.x86_phys_bits;
> > > -	if (boot_cpu_data.x86_phys_bits <
> > > -	    52 - shadow_nonpresent_or_rsvd_mask_len) {
> > > +	shadow_nonpresent_or_rsvd_mask = 0;
> > > +	if (need_l1tf) {
> > >  		shadow_nonpresent_or_rsvd_mask =
> > >  			rsvd_bits(boot_cpu_data.x86_phys_bits -
> > >  				  shadow_nonpresent_or_rsvd_mask_len,
> > 
> > This is broken, the reserved bits mask is being calculated with the wrong
> > number of physical bits.  I think fixing this would eliminate the need for
> > the high_gpa_offset shenanigans.
> 
> You are right. should use 'shadow_phys_bits' instead. Thanks. Let me think whether high_gpa_offset
> can be avoided.
> 
> > 
> > > @@ -4326,7 +4422,7 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu,
> > >  static void
> > >  __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> > >  			struct rsvd_bits_validate *rsvd_check,
> > > -			int maxphyaddr, int level, bool nx, bool gbpages,
> > > +			int first_rsvd_bit, int level, bool nx, bool gbpages,
> > >  			bool pse, bool amd)
> > 
> > Similar to the earlier comment regarding 'first', it's probably less
> > confusing overall to just leave this as 'maxphyaddr'.
> 
> Both work for me. But maybe 'non_rsvd_maxphyaddr' is better? 

Maybe?  My personal preference would be to stay with maxphyaddr.