Re: [PATCH] kvm: x86: Fix several SPTE mask calculation errors caused by MKTME

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2019-04-22 at 09:39 -0700, Sean Christopherson wrote:
> On Tue, Apr 16, 2019 at 09:19:48PM +1200, Kai Huang wrote:
> > With both Intel MKTME and AMD SME/SEV, physical address bits are reduced
> > due to several high bits of physical address are repurposed for memory
> > encryption. To honor such behavior those repurposed bits are reduced from
> > cpuinfo_x86->x86_phys_bits for both Intel MKTME and AMD SME/SEV, thus
> > boot_cpu_data.x86_phys_bits doesn't hold physical address bits reported
> > by CPUID anymore.
> 
> This neglects to mention the most relevant tidbit of information in terms
> of justification for this patch: the number of bits stolen for MKTME is
> programmed by BIOS, i.e. bits may be repurposed for MKTME regardless of
> kernel support.

I can add BIOS part. But the key issue is kernel adjusts boot_cpu_data.x86_phys_bits, isn't it?

If kernel doesn't adjust boot_cpu_data.x86_phys_bits then this patch theoretically is not needed?

> 
> > KVM uses boot_cpu_data.x86_phys_bits to calculate several SPTE masks
> > based on assumption that: 1) boot_cpu_data.x86_phys_bits equals to
> > physical address bits reported by CPUID -- this is used to check CPU has
> > reserved bits when KVM calculates shadow_mmio_{value|mask}; and whether
> > shadow_nonpresent_or_rsvd_mask should be setup (KVM assumes L1TF is not
> > present if CPU has 52 physical address bits); 2) if it is smaller than
> > 52, bits [x86_phys_bits, 51] are reserved bits.
> > 
> > With Intel MKTME or AMD SME/SEV above assumption is not valid any more,
> > especially when calculating reserved bits with Intel MKTME, since Intel
> > MKTME treats the reduced bits as 'keyID', thus they are not reduced
> > bits, therefore boot_cpu_data.x86_phys_bits cannot be used to calcualte
> > reserved bits anymore, although we can still use it for AMD SME/SEV
> > since SME/SEV treats the reduced bits differently -- they are treated as
> > reserved bits, the same as other reserved bits in page table entity [1].
> > 
> > Fix by introducing a new 'shadow_first_rsvd_bit' variable in kvm x86 MMU
> > code to store the actual position of reserved bits -- for Intel MKTME,
> > it equals to physical address reported by CPUID, and for AMD SME/SEV, it
> > is boot_cpu_data.x86_phys_bits. And in reserved bits related calculation
> > it is used instead of boot_cpu_data.x86_phys_bits. Some other code
> > changes too to make code more reasonable, ie, kvm_set_mmio_spte_mask is
> > moved to x86/kvm/mmu.c from x86/kvm/x86.c to use shadow_first_rsvd_bit;
> > shadow_nonpresent_or_rsvd_mask calculation logic is slightly changed,
> > based on the fact that only Intel CPU is impacted by L1TF, so that KVM
> > can use shadow_first_rsvd_bit to check whether KVM should set
> > shadow_nonpresent_or_rsvd_mask or not.
> > 
> > Note that for the physical address bits reported to guest should remain
> > unchanged -- KVM should report physical address reported by CPUID to
> > guest, but not boot_cpu_data.x86_phys_bits. Because for Intel MKTME,
> > there's no harm if guest sets up 'keyID' bits in guest page table (since
> > MKTME only works at physical address level), and KVM doesn't even expose
> > MKTME to guest. Arguably, for AMD SME/SEV, guest is aware of SEV thus it
> > should adjust boot_cpu_data.x86_phys_bits when it detects SEV, therefore
> > KVM should still reports physcial address reported by CPUID to guest.
> > 
> > [1] Section 7.10.1 Determining Support for Secure Memory Encryption,
> >     AMD Architecture Programmer's Manual Volume 2: System Programming).
> > Signed-off-by: Kai Huang <kai.huang@xxxxxxxxxxxxxxx>
> > ---
> >  arch/x86/kvm/mmu.c | 152 +++++++++++++++++++++++++++++++++++++++++++----------
> >  arch/x86/kvm/mmu.h |   1 +
> >  arch/x86/kvm/x86.c |  29 ----------
> >  3 files changed, 125 insertions(+), 57 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index e10962dfc203..3add7283980e 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -261,6 +261,22 @@ static const u64 shadow_nonpresent_or_rsvd_mask_len = 5;
> >   */
> >  static u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
> >  
> > +/*
> > + * The position of first reserved bit. If it equals to 52, then CPU doesn't
> > + * have any reserved bits, otherwise bits [shadow_first_rsvd_bit, 51] are
> > + * reserved bits.
> > + *
> > + * boot_cpu_data.x86_phys_bits cannot be used to determine the position of
> > + * reserved bits anymore, since both Intel MKTME and AMD SME/SEV reduce
> > + * physical address bits, and the reduced bits are taken away from
> > + * boot_cpu_data.x86_phys_bits to reflect such fact. But Intel MKTME and AMD
> > + * SME/SEV treat those reduced bits differently -- Intel MKTME treats them
> > + * as 'keyID' thus not reserved bits, but AMD SME/SEV treats them as reserved
> > + * bits, thus physical address bits reported by CPUID cannot be used to
> > + * determine reserved bits position either.
> > + */
> 
> No need to rehash the justification for the variable, it's sufficient to
> state *what* the variable tracks.  And the whole first paragraph can be
> dropped if the variable is renamed.
> 
> > +static u64 __read_mostly shadow_first_rsvd_bit;
> 
> Hmm, 'first' is technically incorrect since EPT has reserved attribute bits
> on non-leaf entries.  And 'first' is vague, e.g. it could be interpreted as
> the MSB or LSB.
> 
> In the end, we're still tracking the number of physical address bits, so
> maybe something like 'shadow_phys_bits'?
> 
> E.g. putting it together:
> 
> /*
>  * The number of non-reserved physical address bits irrespective of features
>  * that repurpose software-accessible bits, e.g. MKTME.
>  */
> static u64 __read_mostly shadow_phys_bits;

Fine to me. Will do what you suggested.

> 
> > +
> >  
> >  static void mmu_spte_set(u64 *sptep, u64 spte);
> >  static union kvm_mmu_page_role
> > @@ -303,6 +319,34 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
> >  
> > +void kvm_set_mmio_spte_mask(void)
> 
> Moving this to mmu.c makes sense, but calling it from kvm_arch_init() is
> silly.  The current call site is immediately after kvm_mmu_module_init(),
> I don't see any reason not to just move the call there.

I don't know whether calling kvm_set_mmio_spte_mask() from kvm_arch_init() is silly but maybe
there's histroy -- KVM calls kvm_mmu_set_mask_ptes() from kvm_arch_init() too, which may also be
silly according to your judgement. I have no problem calling kvm_set_mmio_spte_mask() from
kvm_mmu_module_init(), but IMHO this logic is irrelevant to this patch, and it's better to have a
separate patch for this purpose if necessary?


> > +{
> > +	u64 mask;
> > +
> > +	/*
> > +	 * Set the reserved bits and the present bit of an paging-structure
> > +	 * entry to generate page fault with PFER.RSV = 1.
> > +	 */
> > +
> > +	/*
> > +	 * Mask the uppermost physical address bit, which would be reserved as
> > +	 * long as the supported physical address width is less than 52.
> > +	 */
> > +	mask = 1ull << 51;
> > +
> > +	/* Set the present bit. */
> > +	mask |= 1ull;
> > +
> > +	/*
> > +	 * If reserved bit is not supported, clear the present bit to disable
> > +	 * mmio page fault.
> > +	 */
> > +	if (IS_ENABLED(CONFIG_X86_64) && shadow_first_rsvd_bit == 52)
> > +		mask &= ~1ull;
> > +
> > +	kvm_mmu_set_mmio_spte_mask(mask, mask);
> > +}
> > +
> >  static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
> >  {
> >  	return sp->role.ad_disabled;
> > @@ -384,12 +428,21 @@ static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
> >  	u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK;
> >  	u64 mask = generation_mmio_spte_mask(gen);
> >  	u64 gpa = gfn << PAGE_SHIFT;
> > +	u64 high_gpa_offset;
> >  
> >  	access &= ACC_WRITE_MASK | ACC_USER_MASK;
> >  	mask |= shadow_mmio_value | access;
> >  	mask |= gpa | shadow_nonpresent_or_rsvd_mask;
> > +	/*
> > +	 * With Intel MKTME, the bits from boot_cpu_data.x86_phys_bits to
> > +	 * shadow_first_rsvd_bit - 1 are actually keyID bits but not
> > +	 * reserved bits. We need to put high GPA bits to actual reserved
> > +	 * bits to mitigate L1TF attack.
> > +	 */
> > +	high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len +
> > +		shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits;
> >  	mask |= (gpa & shadow_nonpresent_or_rsvd_mask)
> > -		<< shadow_nonpresent_or_rsvd_mask_len;
> > +		<< high_gpa_offset;
> >  
> >  	page_header(__pa(sptep))->mmio_cached = true;
> >  
> > @@ -405,8 +458,11 @@ static bool is_mmio_spte(u64 spte)
> >  static gfn_t get_mmio_spte_gfn(u64 spte)
> >  {
> >  	u64 gpa = spte & shadow_nonpresent_or_rsvd_lower_gfn_mask;
> > +	/* See comments in mark_mmio_spte */
> > +	u64 high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len +
> > +		shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits;
> 
> The exact shift needed is constant after init, there should be no need to
> dynamically calculate it for every MMIO SPTE.

I can add one more static variable such as 'shadow_high_gfn_offset' and calculate it in
kvm_mmu_reset_all_pte_masks (if necessary, as you mentioned below)? Is this good to you?

> 
> > -	gpa |= (spte >> shadow_nonpresent_or_rsvd_mask_len)
> > +	gpa |= (spte >> high_gpa_offset)
> >  	       & shadow_nonpresent_or_rsvd_mask;
> >  
> >  	return gpa >> PAGE_SHIFT;
> > @@ -470,9 +526,22 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes);
> >  
> > +static u8 kvm_get_cpuid_phys_bits(void)
> > +{
> > +	u32 eax, ebx, ecx, edx;
> > +
> > +	if (boot_cpu_data.extended_cpuid_level < 0x80000008)
> > +		return boot_cpu_data.x86_phys_bits;
> > +
> > +	cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
> > +
> > +	return eax & 0xff;
> 
> cpuid_eax() will do most of the work for you.

Sure will do. thanks.

> 
> > +}
> > +
> >  static void kvm_mmu_reset_all_pte_masks(void)
> >  {
> >  	u8 low_phys_bits;
> > +	bool need_l1tf;
> >  
> >  	shadow_user_mask = 0;
> >  	shadow_accessed_mask = 0;
> > @@ -484,13 +553,40 @@ static void kvm_mmu_reset_all_pte_masks(void)
> >  	shadow_acc_track_mask = 0;
> >  
> >  	/*
> > -	 * If the CPU has 46 or less physical address bits, then set an
> > -	 * appropriate mask to guard against L1TF attacks. Otherwise, it is
> > +	 * Calcualte the first reserved bit position. Although both Intel
> > +	 * MKTME and AMD SME/SEV reduce physical address bits for memory
> > +	 * encryption (and boot_cpu_data.x86_phys_bits is reduced to reflect
> > +	 * such fact), they treat those reduced bits differently -- Intel
> > +	 * MKTME treats those as 'keyID' thus not reserved bits, but AMD
> > +	 * SME/SEV still treats those bits as reserved bits, so for AMD
> > +	 * shadow_first_rsvd_bit is boot_cpu_data.x86_phys_bits, but for
> > +	 * Intel (and other x86 vendors that don't support memory encryption
> > +	 * at all), shadow_first_rsvd_bit is physical address bits reported
> > +	 * by CPUID.
> > +	 */
> > +	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
> 
> Checking for a specific vendor should only be done as a last resort, e.g.
> I believe this will break Hygon CPUs w/ SME/SEV.  

This can be easily adjusted in the future. I can add Hygon check to this patch.

> E.g. the exisiting
> "is AMD" check in mmu.c looks at shadow_x_mask, not an CPUID vendor.

shadow_x_mask hasn't been calculated here. It is calcualted later in kvm_mmu_set_mask_ptes() which
is called from kvm_arch_init() after kvm_x86_ops is setup.

And does this justify "using CPU vendor check should be as last resort"? There are other kernel code
 checks CPU vendors too.


> {MK}TME provides a CPUID feature bit and MSRs to query support, and this
> code only runs at module init so the overhead of a RDMSR is negligible. 

Yes can check against CPUID feature bit but not sure why it is better than CPU vendor.

> 
> > +		shadow_first_rsvd_bit = boot_cpu_data.x86_phys_bits;
> > +	else
> > +		shadow_first_rsvd_bit = kvm_get_cpuid_phys_bits();
> 
> If you rename the helper to be less specific and actually check for MKTME
> support then the MKTME comment doesn't need to be as verbose, e.g.: 
> 
> static u8 kvm_get_shadow_phys_bits(void)
> {
> 	if (!<has MKTME> ||
> 	    WARN_ON_ONCE(boot_cpu_data.extended_cpuid_level < 0x80000008))
> 		return boot_cpu_data.x86_phys_bits;

Why do we need WARN_ON_ONCE here?

I don't have problem using as you suggested, but I don't get why checking against CPU vendor is
last resort?

> 
> 	/*
> 	 * MKTME steals physical address bits for key IDs, but the key ID bits
> 	 * are not treated as reserved.  x86_phys_bits is adjusted to account
> 	 * for the stolen bits, use CPUID.MAX_PA_WIDTH directly which reports
> 	 * the number of software-available bits irrespective of MKTME.
> 	 */
> 	return cpuid_eax(0x80000008) & 0xff;
> }
> 
> > +
> > +	/*
> > +	 * Only Intel is impacted by L1TF, therefore for AMD and other x86
> > +	 * vendors L1TF mitigation is not needed.
> > +	 *
> > +	 * For Intel CPU, if it has 46 or less physical address bits, then set
> > +	 * an appropriate mask to guard against L1TF attacks. Otherwise, it is
> >  	 * assumed that the CPU is not vulnerable to L1TF.
> >  	 */
> > +	if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) &&
> 
> Again, checking vendors is bad.  Not employing a mitigation technique due
> to an assumption about what processors are affected by a vulnerability is
> even worse.  

Why, isn't it a fact that L1TF only impacts Intel CPUs? What's the benefit of employing a mitigation
to CPUs that don't have L1TF issue? Such mitigation only makes performance worse, even not
noticable. For example, why not employing such mitigation regardless to physical address bits at
all?

> The existing code makes the assumption regarding processors
> with >46 bits of address space because a) no such processor existed before
> the discovery of L1TF, and it's reasonable to assume hardware vendors
> won't ship future processors with such an obvious vulnerability, and b)
> hardcoding the number of reserved bits to set simplifies the code.

Yes, but we cannot simply use 'shadow_phys_bits' to check against 46 anymore, right?

For example, if AMD has 52 phys bits, but it reduces 5 or more bits, then current KVM code would
employ l1tf mitigation, but actually it really shouldn't?

> 
> > +			(shadow_first_rsvd_bit <
> > +				52 - shadow_nonpresent_or_rsvd_mask_len))
> > +		need_l1tf = true;
> > +	else
> > +		need_l1tf = false;
> > +
> >  	low_phys_bits = boot_cpu_data.x86_phys_bits;
> > -	if (boot_cpu_data.x86_phys_bits <
> > -	    52 - shadow_nonpresent_or_rsvd_mask_len) {
> > +	shadow_nonpresent_or_rsvd_mask = 0;
> > +	if (need_l1tf) {
> >  		shadow_nonpresent_or_rsvd_mask =
> >  			rsvd_bits(boot_cpu_data.x86_phys_bits -
> >  				  shadow_nonpresent_or_rsvd_mask_len,
> 
> This is broken, the reserved bits mask is being calculated with the wrong
> number of physical bits.  I think fixing this would eliminate the need for
> the high_gpa_offset shenanigans.

You are right. should use 'shadow_phys_bits' instead. Thanks. Let me think whether high_gpa_offset
can be avoided.

> 
> > @@ -4326,7 +4422,7 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu,
> >  static void
> >  __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> >  			struct rsvd_bits_validate *rsvd_check,
> > -			int maxphyaddr, int level, bool nx, bool gbpages,
> > +			int first_rsvd_bit, int level, bool nx, bool gbpages,
> >  			bool pse, bool amd)
> 
> Similar to the earlier comment regarding 'first', it's probably less
> confusing overall to just leave this as 'maxphyaddr'.

Both work for me. But maybe 'non_rsvd_maxphyaddr' is better? 

Thanks,
-Kai




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux