Re: [PATCH] kvm: x86: Fix several SPTE mask calculation errors caused by MKTME

Sean Christopherson <sean.j.christopherson@xxxxxxxxx> · Mon, 22 Apr 2019 09:39:29 -0700

On Tue, Apr 16, 2019 at 09:19:48PM +1200, Kai Huang wrote:
> With both Intel MKTME and AMD SME/SEV, physical address bits are reduced
> due to several high bits of physical address are repurposed for memory
> encryption. To honor such behavior those repurposed bits are reduced from
> cpuinfo_x86->x86_phys_bits for both Intel MKTME and AMD SME/SEV, thus
> boot_cpu_data.x86_phys_bits doesn't hold physical address bits reported
> by CPUID anymore.

This neglects to mention the most relevant tidbit of information in terms
of justification for this patch: the number of bits stolen for MKTME is
programmed by BIOS, i.e. bits may be repurposed for MKTME regardless of
kernel support.

> KVM uses boot_cpu_data.x86_phys_bits to calculate several SPTE masks
> based on assumption that: 1) boot_cpu_data.x86_phys_bits equals to
> physical address bits reported by CPUID -- this is used to check CPU has
> reserved bits when KVM calculates shadow_mmio_{value|mask}; and whether
> shadow_nonpresent_or_rsvd_mask should be setup (KVM assumes L1TF is not
> present if CPU has 52 physical address bits); 2) if it is smaller than
> 52, bits [x86_phys_bits, 51] are reserved bits.
> 
> With Intel MKTME or AMD SME/SEV above assumption is not valid any more,
> especially when calculating reserved bits with Intel MKTME, since Intel
> MKTME treats the reduced bits as 'keyID', thus they are not reduced
> bits, therefore boot_cpu_data.x86_phys_bits cannot be used to calcualte
> reserved bits anymore, although we can still use it for AMD SME/SEV
> since SME/SEV treats the reduced bits differently -- they are treated as
> reserved bits, the same as other reserved bits in page table entity [1].
> 
> Fix by introducing a new 'shadow_first_rsvd_bit' variable in kvm x86 MMU
> code to store the actual position of reserved bits -- for Intel MKTME,
> it equals to physical address reported by CPUID, and for AMD SME/SEV, it
> is boot_cpu_data.x86_phys_bits. And in reserved bits related calculation
> it is used instead of boot_cpu_data.x86_phys_bits. Some other code
> changes too to make code more reasonable, ie, kvm_set_mmio_spte_mask is
> moved to x86/kvm/mmu.c from x86/kvm/x86.c to use shadow_first_rsvd_bit;
> shadow_nonpresent_or_rsvd_mask calculation logic is slightly changed,
> based on the fact that only Intel CPU is impacted by L1TF, so that KVM
> can use shadow_first_rsvd_bit to check whether KVM should set
> shadow_nonpresent_or_rsvd_mask or not.
> 
> Note that for the physical address bits reported to guest should remain
> unchanged -- KVM should report physical address reported by CPUID to
> guest, but not boot_cpu_data.x86_phys_bits. Because for Intel MKTME,
> there's no harm if guest sets up 'keyID' bits in guest page table (since
> MKTME only works at physical address level), and KVM doesn't even expose
> MKTME to guest. Arguably, for AMD SME/SEV, guest is aware of SEV thus it
> should adjust boot_cpu_data.x86_phys_bits when it detects SEV, therefore
> KVM should still reports physcial address reported by CPUID to guest.
> 
> [1] Section 7.10.1 Determining Support for Secure Memory Encryption,
>     AMD Architecture Programmer's Manual Volume 2: System Programming).
> Signed-off-by: Kai Huang <kai.huang@xxxxxxxxxxxxxxx>
> ---
>  arch/x86/kvm/mmu.c | 152 +++++++++++++++++++++++++++++++++++++++++++----------
>  arch/x86/kvm/mmu.h |   1 +
>  arch/x86/kvm/x86.c |  29 ----------
>  3 files changed, 125 insertions(+), 57 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index e10962dfc203..3add7283980e 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -261,6 +261,22 @@ static const u64 shadow_nonpresent_or_rsvd_mask_len = 5;
>   */
>  static u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
>  
> +/*
> + * The position of first reserved bit. If it equals to 52, then CPU doesn't
> + * have any reserved bits, otherwise bits [shadow_first_rsvd_bit, 51] are
> + * reserved bits.
> + *
> + * boot_cpu_data.x86_phys_bits cannot be used to determine the position of
> + * reserved bits anymore, since both Intel MKTME and AMD SME/SEV reduce
> + * physical address bits, and the reduced bits are taken away from
> + * boot_cpu_data.x86_phys_bits to reflect such fact. But Intel MKTME and AMD
> + * SME/SEV treat those reduced bits differently -- Intel MKTME treats them
> + * as 'keyID' thus not reserved bits, but AMD SME/SEV treats them as reserved
> + * bits, thus physical address bits reported by CPUID cannot be used to
> + * determine reserved bits position either.
> + */

No need to rehash the justification for the variable, it's sufficient to
state *what* the variable tracks.  And the whole first paragraph can be
dropped if the variable is renamed.

> +static u64 __read_mostly shadow_first_rsvd_bit;

Hmm, 'first' is technically incorrect since EPT has reserved attribute bits
on non-leaf entries.  And 'first' is vague, e.g. it could be interpreted as
the MSB or LSB.

In the end, we're still tracking the number of physical address bits, so
maybe something like 'shadow_phys_bits'?

E.g. putting it together:

/*
 * The number of non-reserved physical address bits irrespective of features
 * that repurpose software-accessible bits, e.g. MKTME.
 */
static u64 __read_mostly shadow_phys_bits;

> +
>  
>  static void mmu_spte_set(u64 *sptep, u64 spte);
>  static union kvm_mmu_page_role
> @@ -303,6 +319,34 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value)
>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
>  
> +void kvm_set_mmio_spte_mask(void)

Moving this to mmu.c makes sense, but calling it from kvm_arch_init() is
silly.  The current call site is immediately after kvm_mmu_module_init(),
I don't see any reason not to just move the call there.

> +{
> +	u64 mask;
> +
> +	/*
> +	 * Set the reserved bits and the present bit of an paging-structure
> +	 * entry to generate page fault with PFER.RSV = 1.
> +	 */
> +
> +	/*
> +	 * Mask the uppermost physical address bit, which would be reserved as
> +	 * long as the supported physical address width is less than 52.
> +	 */
> +	mask = 1ull << 51;
> +
> +	/* Set the present bit. */
> +	mask |= 1ull;
> +
> +	/*
> +	 * If reserved bit is not supported, clear the present bit to disable
> +	 * mmio page fault.
> +	 */
> +	if (IS_ENABLED(CONFIG_X86_64) && shadow_first_rsvd_bit == 52)
> +		mask &= ~1ull;
> +
> +	kvm_mmu_set_mmio_spte_mask(mask, mask);
> +}
> +
>  static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
>  {
>  	return sp->role.ad_disabled;
> @@ -384,12 +428,21 @@ static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
>  	u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK;
>  	u64 mask = generation_mmio_spte_mask(gen);
>  	u64 gpa = gfn << PAGE_SHIFT;
> +	u64 high_gpa_offset;
>  
>  	access &= ACC_WRITE_MASK | ACC_USER_MASK;
>  	mask |= shadow_mmio_value | access;
>  	mask |= gpa | shadow_nonpresent_or_rsvd_mask;
> +	/*
> +	 * With Intel MKTME, the bits from boot_cpu_data.x86_phys_bits to
> +	 * shadow_first_rsvd_bit - 1 are actually keyID bits but not
> +	 * reserved bits. We need to put high GPA bits to actual reserved
> +	 * bits to mitigate L1TF attack.
> +	 */
> +	high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len +
> +		shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits;
>  	mask |= (gpa & shadow_nonpresent_or_rsvd_mask)
> -		<< shadow_nonpresent_or_rsvd_mask_len;
> +		<< high_gpa_offset;
>  
>  	page_header(__pa(sptep))->mmio_cached = true;
>  
> @@ -405,8 +458,11 @@ static bool is_mmio_spte(u64 spte)
>  static gfn_t get_mmio_spte_gfn(u64 spte)
>  {
>  	u64 gpa = spte & shadow_nonpresent_or_rsvd_lower_gfn_mask;
> +	/* See comments in mark_mmio_spte */
> +	u64 high_gpa_offset = shadow_nonpresent_or_rsvd_mask_len +
> +		shadow_first_rsvd_bit - boot_cpu_data.x86_phys_bits;

The exact shift needed is constant after init, there should be no need to
dynamically calculate it for every MMIO SPTE.

> -	gpa |= (spte >> shadow_nonpresent_or_rsvd_mask_len)
> +	gpa |= (spte >> high_gpa_offset)
>  	       & shadow_nonpresent_or_rsvd_mask;
>  
>  	return gpa >> PAGE_SHIFT;
> @@ -470,9 +526,22 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes);
>  
> +static u8 kvm_get_cpuid_phys_bits(void)
> +{
> +	u32 eax, ebx, ecx, edx;
> +
> +	if (boot_cpu_data.extended_cpuid_level < 0x80000008)
> +		return boot_cpu_data.x86_phys_bits;
> +
> +	cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
> +
> +	return eax & 0xff;

cpuid_eax() will do most of the work for you.

> +}
> +
>  static void kvm_mmu_reset_all_pte_masks(void)
>  {
>  	u8 low_phys_bits;
> +	bool need_l1tf;
>  
>  	shadow_user_mask = 0;
>  	shadow_accessed_mask = 0;
> @@ -484,13 +553,40 @@ static void kvm_mmu_reset_all_pte_masks(void)
>  	shadow_acc_track_mask = 0;
>  
>  	/*
> -	 * If the CPU has 46 or less physical address bits, then set an
> -	 * appropriate mask to guard against L1TF attacks. Otherwise, it is
> +	 * Calcualte the first reserved bit position. Although both Intel
> +	 * MKTME and AMD SME/SEV reduce physical address bits for memory
> +	 * encryption (and boot_cpu_data.x86_phys_bits is reduced to reflect
> +	 * such fact), they treat those reduced bits differently -- Intel
> +	 * MKTME treats those as 'keyID' thus not reserved bits, but AMD
> +	 * SME/SEV still treats those bits as reserved bits, so for AMD
> +	 * shadow_first_rsvd_bit is boot_cpu_data.x86_phys_bits, but for
> +	 * Intel (and other x86 vendors that don't support memory encryption
> +	 * at all), shadow_first_rsvd_bit is physical address bits reported
> +	 * by CPUID.
> +	 */
> +	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)

Checking for a specific vendor should only be done as a last resort, e.g.
I believe this will break Hygon CPUs w/ SME/SEV.  E.g. the exisiting
"is AMD" check in mmu.c looks at shadow_x_mask, not an CPUID vendor.
{MK}TME provides a CPUID feature bit and MSRs to query support, and this
code only runs at module init so the overhead of a RDMSR is negligible. 

> +		shadow_first_rsvd_bit = boot_cpu_data.x86_phys_bits;
> +	else
> +		shadow_first_rsvd_bit = kvm_get_cpuid_phys_bits();

If you rename the helper to be less specific and actually check for MKTME
support then the MKTME comment doesn't need to be as verbose, e.g.: 

static u8 kvm_get_shadow_phys_bits(void)
{
	if (!<has MKTME> ||
	    WARN_ON_ONCE(boot_cpu_data.extended_cpuid_level < 0x80000008))
		return boot_cpu_data.x86_phys_bits;

	/*
	 * MKTME steals physical address bits for key IDs, but the key ID bits
	 * are not treated as reserved.  x86_phys_bits is adjusted to account
	 * for the stolen bits, use CPUID.MAX_PA_WIDTH directly which reports
	 * the number of software-available bits irrespective of MKTME.
	 */
	return cpuid_eax(0x80000008) & 0xff;
}

> +
> +	/*
> +	 * Only Intel is impacted by L1TF, therefore for AMD and other x86
> +	 * vendors L1TF mitigation is not needed.
> +	 *
> +	 * For Intel CPU, if it has 46 or less physical address bits, then set
> +	 * an appropriate mask to guard against L1TF attacks. Otherwise, it is
>  	 * assumed that the CPU is not vulnerable to L1TF.
>  	 */
> +	if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) &&

Again, checking vendors is bad.  Not employing a mitigation technique due
to an assumption about what processors are affected by a vulnerability is
even worse.  The existing code makes the assumption regarding processors
with >46 bits of address space because a) no such processor existed before
the discovery of L1TF, and it's reasonable to assume hardware vendors
won't ship future processors with such an obvious vulnerability, and b)
hardcoding the number of reserved bits to set simplifies the code.

> +			(shadow_first_rsvd_bit <
> +				52 - shadow_nonpresent_or_rsvd_mask_len))
> +		need_l1tf = true;
> +	else
> +		need_l1tf = false;
> +
>  	low_phys_bits = boot_cpu_data.x86_phys_bits;
> -	if (boot_cpu_data.x86_phys_bits <
> -	    52 - shadow_nonpresent_or_rsvd_mask_len) {
> +	shadow_nonpresent_or_rsvd_mask = 0;
> +	if (need_l1tf) {
>  		shadow_nonpresent_or_rsvd_mask =
>  			rsvd_bits(boot_cpu_data.x86_phys_bits -
>  				  shadow_nonpresent_or_rsvd_mask_len,

This is broken, the reserved bits mask is being calculated with the wrong
number of physical bits.  I think fixing this would eliminate the need for
the high_gpa_offset shenanigans.

> @@ -4326,7 +4422,7 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu,
>  static void
>  __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
>  			struct rsvd_bits_validate *rsvd_check,
> -			int maxphyaddr, int level, bool nx, bool gbpages,
> +			int first_rsvd_bit, int level, bool nx, bool gbpages,
>  			bool pse, bool amd)

Similar to the earlier comment regarding 'first', it's probably less
confusing overall to just leave this as 'maxphyaddr'.

>  {
>  	u64 exb_bit_rsvd = 0;
> @@ -4369,14 +4465,14 @@ __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
>  		break;
>  	case PT32E_ROOT_LEVEL:
>  		rsvd_check->rsvd_bits_mask[0][2] =
> -			rsvd_bits(maxphyaddr, 63) |
> +			rsvd_bits(first_rsvd_bit, 63) |
>  			rsvd_bits(5, 8) | rsvd_bits(1, 2);	/* PDPTE */
>  		rsvd_check->rsvd_bits_mask[0][1] = exb_bit_rsvd |
> -			rsvd_bits(maxphyaddr, 62);	/* PDE */
> +			rsvd_bits(first_rsvd_bit, 62);	/* PDE */
>  		rsvd_check->rsvd_bits_mask[0][0] = exb_bit_rsvd |
> -			rsvd_bits(maxphyaddr, 62); 	/* PTE */
> +			rsvd_bits(first_rsvd_bit, 62); 	/* PTE */
>  		rsvd_check->rsvd_bits_mask[1][1] = exb_bit_rsvd |
> -			rsvd_bits(maxphyaddr, 62) |
> +			rsvd_bits(first_rsvd_bit, 62) |
>  			rsvd_bits(13, 20);		/* large page */
>  		rsvd_check->rsvd_bits_mask[1][0] =
>  			rsvd_check->rsvd_bits_mask[0][0];
> @@ -4384,28 +4480,28 @@ __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
>  	case PT64_ROOT_5LEVEL:
>  		rsvd_check->rsvd_bits_mask[0][4] = exb_bit_rsvd |
>  			nonleaf_bit8_rsvd | rsvd_bits(7, 7) |
> -			rsvd_bits(maxphyaddr, 51);
> +			rsvd_bits(first_rsvd_bit, 51);
>  		rsvd_check->rsvd_bits_mask[1][4] =
>  			rsvd_check->rsvd_bits_mask[0][4];
>  		/* fall through */
>  	case PT64_ROOT_4LEVEL:
>  		rsvd_check->rsvd_bits_mask[0][3] = exb_bit_rsvd |
>  			nonleaf_bit8_rsvd | rsvd_bits(7, 7) |
> -			rsvd_bits(maxphyaddr, 51);
> +			rsvd_bits(first_rsvd_bit, 51);
>  		rsvd_check->rsvd_bits_mask[0][2] = exb_bit_rsvd |
>  			nonleaf_bit8_rsvd | gbpages_bit_rsvd |
> -			rsvd_bits(maxphyaddr, 51);
> +			rsvd_bits(first_rsvd_bit, 51);
>  		rsvd_check->rsvd_bits_mask[0][1] = exb_bit_rsvd |
> -			rsvd_bits(maxphyaddr, 51);
> +			rsvd_bits(first_rsvd_bit, 51);
>  		rsvd_check->rsvd_bits_mask[0][0] = exb_bit_rsvd |
> -			rsvd_bits(maxphyaddr, 51);
> +			rsvd_bits(first_rsvd_bit, 51);
>  		rsvd_check->rsvd_bits_mask[1][3] =
>  			rsvd_check->rsvd_bits_mask[0][3];
>  		rsvd_check->rsvd_bits_mask[1][2] = exb_bit_rsvd |
> -			gbpages_bit_rsvd | rsvd_bits(maxphyaddr, 51) |
> +			gbpages_bit_rsvd | rsvd_bits(first_rsvd_bit, 51) |
>  			rsvd_bits(13, 29);
>  		rsvd_check->rsvd_bits_mask[1][1] = exb_bit_rsvd |
> -			rsvd_bits(maxphyaddr, 51) |
> +			rsvd_bits(first_rsvd_bit, 51) |
>  			rsvd_bits(13, 20);		/* large page */
>  		rsvd_check->rsvd_bits_mask[1][0] =
>  			rsvd_check->rsvd_bits_mask[0][0];
> @@ -4425,27 +4521,27 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
>  
>  static void
>  __reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check,
> -			    int maxphyaddr, bool execonly)
> +			    int first_rsvd_bit, bool execonly)
>  {
>  	u64 bad_mt_xwr;
>  
>  	rsvd_check->rsvd_bits_mask[0][4] =
> -		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
> +		rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 7);
>  	rsvd_check->rsvd_bits_mask[0][3] =
> -		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
> +		rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 7);
>  	rsvd_check->rsvd_bits_mask[0][2] =
> -		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> +		rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 6);
>  	rsvd_check->rsvd_bits_mask[0][1] =
> -		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> -	rsvd_check->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
> +		rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(3, 6);
> +	rsvd_check->rsvd_bits_mask[0][0] = rsvd_bits(first_rsvd_bit, 51);
>  
>  	/* large page */
>  	rsvd_check->rsvd_bits_mask[1][4] = rsvd_check->rsvd_bits_mask[0][4];
>  	rsvd_check->rsvd_bits_mask[1][3] = rsvd_check->rsvd_bits_mask[0][3];
>  	rsvd_check->rsvd_bits_mask[1][2] =
> -		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
> +		rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(12, 29);
>  	rsvd_check->rsvd_bits_mask[1][1] =
> -		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
> +		rsvd_bits(first_rsvd_bit, 51) | rsvd_bits(12, 20);
>  	rsvd_check->rsvd_bits_mask[1][0] = rsvd_check->rsvd_bits_mask[0][0];
>  
>  	bad_mt_xwr = 0xFFull << (2 * 8);	/* bits 3..5 must not be 2 */
> @@ -4523,13 +4619,13 @@ reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
>  
>  	if (boot_cpu_is_amd())
>  		__reset_rsvds_bits_mask(vcpu, shadow_zero_check,
> -					boot_cpu_data.x86_phys_bits,
> +					shadow_first_rsvd_bit,
>  					context->shadow_root_level, false,
>  					boot_cpu_has(X86_FEATURE_GBPAGES),
>  					true, true);
>  	else
>  		__reset_rsvds_bits_mask_ept(shadow_zero_check,
> -					    boot_cpu_data.x86_phys_bits,
> +					    shadow_first_rsvd_bit,
>  					    false);
>  
>  	if (!shadow_me_mask)
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 54c2a377795b..45e579e112ea 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -51,6 +51,7 @@ static inline u64 rsvd_bits(int s, int e)
>  	return ((1ULL << (e - s + 1)) - 1) << s;
>  }
>  
> +void kvm_set_mmio_spte_mask(void);
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value);
>  
>  void
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 38440316a806..3a71aab495b8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6848,35 +6848,6 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
>  	.get_guest_ip		= kvm_get_guest_ip,
>  };
>  
> -static void kvm_set_mmio_spte_mask(void)
> -{
> -	u64 mask;
> -	int maxphyaddr = boot_cpu_data.x86_phys_bits;
> -
> -	/*
> -	 * Set the reserved bits and the present bit of an paging-structure
> -	 * entry to generate page fault with PFER.RSV = 1.
> -	 */
> -
> -	/*
> -	 * Mask the uppermost physical address bit, which would be reserved as
> -	 * long as the supported physical address width is less than 52.
> -	 */
> -	mask = 1ull << 51;
> -
> -	/* Set the present bit. */
> -	mask |= 1ull;
> -
> -	/*
> -	 * If reserved bit is not supported, clear the present bit to disable
> -	 * mmio page fault.
> -	 */
> -	if (IS_ENABLED(CONFIG_X86_64) && maxphyaddr == 52)
> -		mask &= ~1ull;
> -
> -	kvm_mmu_set_mmio_spte_mask(mask, mask);
> -}
> -
>  #ifdef CONFIG_X86_64
>  static void pvclock_gtod_update_fn(struct work_struct *work)
>  {
> -- 
> 2.13.6
>