Re: [RFC PATCH v5 033/104] KVM: x86: Add infrastructure for stolen GPA bits

Isaku Yamahata <isaku.yamahata@xxxxxxxxx> · Thu, 31 Mar 2022 19:34:26 -0700

Added Peng Chao.

On Fri, Apr 01, 2022 at 12:16:41AM +1300,
Kai Huang <kai.huang@xxxxxxxxx> wrote:

> On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@xxxxxxxxx wrote:
> > From: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx>
> > 
> > Add support in KVM's MMU for aliasing multiple GPAs (from a hardware
> > perspective) to a single GPA (from a memslot perspective). GPA aliasing
> > will be used to repurpose GPA bits as attribute bits, e.g. to expose an
> > execute-only permission bit to the guest. To keep the implementation
> > simple (relatively speaking), GPA aliasing is only supported via TDP.
> > 
> > Today KVM assumes two things that are broken by GPA aliasing.
> >   1. GPAs coming from hardware can be simply shifted to get the GFNs.
> >   2. GPA bits 51:MAXPHYADDR are reserved to zero.
> > 
> > With GPA aliasing, translating a GPA to GFN requires masking off the
> > repurposed bit, and a repurposed bit may reside in 51:MAXPHYADDR.
> > 
> > To support GPA aliasing, introduce the concept of per-VM GPA stolen bits,
> > that is, bits stolen from the GPA to act as new virtualized attribute
> > bits. A bit in the mask will cause the MMU code to create aliases of the
> > GPA. It can also be used to find the GFN out of a GPA coming from a tdp
> > fault.
> > 
> > To handle case (1) from above, retain any stolen bits when passing a GPA
> > in KVM's MMU code, but strip them when converting to a GFN so that the
> > GFN contains only the "real" GFN, i.e. never has repurposed bits set.
> > 
> > GFNs (without stolen bits) continue to be used to:
> >   - Specify physical memory by userspace via memslots
> >   - Map GPAs to TDP PTEs via RMAP
> >   - Specify dirty tracking and write protection
> >   - Look up MTRR types
> >   - Inject async page faults
> > 
> > Since there are now multiple aliases for the same aliased GPA, when
> > userspace memory backing the memslots is paged out, both aliases need to be
> > modified. Fortunately, this happens automatically. Since rmap supports
> > multiple mappings for the same GFN for PTE shadowing based paging, by
> > adding/removing each alias PTE with its GFN, kvm_handle_hva() based
> > operations will be applied to both aliases.
> > 
> > In the case of the rmap being removed in the future, the needed
> > information could be recovered by iterating over the stolen bits and
> > walking the TDP page tables.
> > 
> > For TLB flushes that are address based, make sure to flush both aliases
> > in the case of stolen bits.
> > 
> > Only support stolen bits in 64 bit guest paging modes (long, PAE).
> > Features that use this infrastructure should restrict the stolen bits to
> > exclude the other paging modes. Don't support stolen bits for shadow EPT.
> > 
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 ++
> >  arch/x86/kvm/mmu.h              | 51 +++++++++++++++++++++++++++++++++
> >  arch/x86/kvm/mmu/mmu.c          | 19 ++++++++++--
> >  arch/x86/kvm/mmu/paging_tmpl.h  | 25 +++++++++-------
> >  4 files changed, 84 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 208b29b0e637..d8b78d6abc10 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1235,7 +1235,9 @@ struct kvm_arch {
> >  	spinlock_t hv_root_tdp_lock;
> >  #endif
> >  
> > +#ifdef CONFIG_KVM_MMU_PRIVATE
> >  	gfn_t gfn_shared_mask;
> > +#endif
> >  };
> >  
> >  struct kvm_vm_stat {
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index e9fbb2c8bbe2..3fb530359f81 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -365,4 +365,55 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
> >  		return gpa;
> >  	return translate_nested_gpa(vcpu, gpa, access, exception);
> >  }
> > +
> > +static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
> > +{
> > +#ifdef CONFIG_KVM_MMU_PRIVATE
> > +	return kvm->arch.gfn_shared_mask;
> > +#else
> > +	return 0;
> > +#endif
> > +}
> > +
> > +static inline gpa_t kvm_gpa_stolen_mask(struct kvm *kvm)
> > +{
> > +	return gfn_to_gpa(kvm_gfn_stolen_mask(kvm));
> > +}
> > +
> > +static inline gpa_t kvm_gpa_unalias(struct kvm *kvm, gpa_t gpa)
> > +{
> > +	return gpa & ~kvm_gpa_stolen_mask(kvm);
> > +}
> > +
> > +static inline gfn_t kvm_gfn_unalias(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	return gfn & ~kvm_gfn_stolen_mask(kvm);
> > +}
> > +
> > +static inline gfn_t kvm_gfn_shared(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	return gfn | kvm_gfn_stolen_mask(kvm);
> > +}
> > +
> > +static inline gfn_t kvm_gfn_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	return gfn & ~kvm_gfn_stolen_mask(kvm);
> > +}
> > +
> > +static inline gpa_t kvm_gpa_private(struct kvm *kvm, gpa_t gpa)
> > +{
> > +	return gpa & ~kvm_gpa_stolen_mask(kvm);
> > +}
> > +
> > +static inline bool kvm_is_private_gfn(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	gfn_t mask = kvm_gfn_stolen_mask(kvm);
> > +
> > +	return mask && !(gfn & mask);
> > +}
> > +
> > +static inline bool kvm_is_private_gpa(struct kvm *kvm, gpa_t gpa)
> > +{
> > +	return kvm_is_private_gfn(kvm, gpa_to_gfn(gpa));
> > +}
> 
> The patch title and commit message say nothing about private/shared, but only
> mention stolen bits in general.  It's weird to introduce those *private* related
> helpers here.
> 
> I think you can just ditch the concept of stolen bit infrastructure, but just
> adopt what TDX needs.

Sure, this patch heavily changed from the original patch Now.  One suggestion
is that private/shared is characteristic to kvm page fault, not gpa/gfn.
It's TDX specific.

- Add a helper function to check if KVM MMU is TD or VM. Right now
  kvm_gfn_stolen_mask() is used.  Probably kvm_mmu_has_private_bit().
  (any better name?)
- Let's keep address conversion functions: address => unalias/shared/private
- Add struct kvm_page_fault.is_private
  see how kvm_is_private_{gpa, gfn}() can be removed (or reduced).

> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 8e24f73bf60b..b68191aa39bf 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -276,11 +276,24 @@ static inline bool kvm_available_flush_tlb_with_range(void)
> >  static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
> >  		struct kvm_tlb_range *range)
> >  {
> > -	int ret = -ENOTSUPP;
> > +	int ret = -EOPNOTSUPP;
> 
> Change doesn't belong to this patch.

Will fix it.

> > +	u64 gfn_stolen_mask;
> >  
> > -	if (range && kvm_x86_ops.tlb_remote_flush_with_range)
> > +	/*
> > +	 * Fall back to the big hammer flush if there is more than one
> > +	 * GPA alias that needs to be flushed.
> > +	 */
> > +	gfn_stolen_mask = kvm_gfn_stolen_mask(kvm);
> > +	if (hweight64(gfn_stolen_mask) > 1)
> > +		goto generic_flush;
> > +
> > +	if (range && kvm_available_flush_tlb_with_range()) {
> > +		/* Callback should flush both private GFN and shared GFN. */
> > +		range->start_gfn = kvm_gfn_unalias(kvm, range->start_gfn);
> 
> This seems wrong.  It seems the intention of this function is to flush TLB for
> all aliases for a given GFN range.  Here it seems you are unconditionally change
> to range to always exclude the stolen bits.

Ooh, right. This alias knowledge is in TDX.  This unalias should be dropped
and put it in tdx.c.  I'll fix it.

> >  		ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
> > +	}
> 
> And you always fall through to do big hammer flush, which is obviously not
> intended.

Please notice "if (ret)".  If it succeeded, big hammer flush is skipped.

> > +generic_flush:
> >  	if (ret)
> >  		kvm_flush_remote_tlbs(kvm);
> >  }
> > @@ -4010,7 +4023,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  	unsigned long mmu_seq;
> >  	int r;
> >  
> > -	fault->gfn = fault->addr >> PAGE_SHIFT;
> > +	fault->gfn = kvm_gfn_unalias(vcpu->kvm, gpa_to_gfn(fault->addr));
> >  	fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
> >  
> >  	if (page_fault_handle_page_track(vcpu, fault))
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index 5b5bdac97c7b..70aec31dee06 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -25,7 +25,8 @@
> >  	#define guest_walker guest_walker64
> >  	#define FNAME(name) paging##64_##name
> >  	#define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
> > -	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> > +	#define PT_LVL_ADDR_MASK(vcpu, lvl) (~kvm_gpa_stolen_mask(vcpu->kvm) & \
> > +					     PT64_LVL_ADDR_MASK(lvl))
> >  	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> >  	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> >  	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> > @@ -44,7 +45,7 @@
> >  	#define guest_walker guest_walker32
> >  	#define FNAME(name) paging##32_##name
> >  	#define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK
> > -	#define PT_LVL_ADDR_MASK(lvl) PT32_LVL_ADDR_MASK(lvl)
> > +	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT32_LVL_ADDR_MASK(lvl)
> >  	#define PT_LVL_OFFSET_MASK(lvl) PT32_LVL_OFFSET_MASK(lvl)
> >  	#define PT_INDEX(addr, level) PT32_INDEX(addr, level)
> >  	#define PT_LEVEL_BITS PT32_LEVEL_BITS
> > @@ -58,7 +59,7 @@
> >  	#define guest_walker guest_walkerEPT
> >  	#define FNAME(name) ept_##name
> >  	#define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
> > -	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> > +	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT64_LVL_ADDR_MASK(lvl)
> >  	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> >  	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> >  	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> > @@ -75,7 +76,7 @@
> >  #define PT_GUEST_ACCESSED_MASK (1 << PT_GUEST_ACCESSED_SHIFT)
> >  
> >  #define gpte_to_gfn_lvl FNAME(gpte_to_gfn_lvl)
> > -#define gpte_to_gfn(pte) gpte_to_gfn_lvl((pte), PG_LEVEL_4K)
> > +#define gpte_to_gfn(vcpu, pte) gpte_to_gfn_lvl(vcpu, pte, PG_LEVEL_4K)
> >  
> >  /*
> >   * The guest_walker structure emulates the behavior of the hardware page
> > @@ -96,9 +97,9 @@ struct guest_walker {
> >  	struct x86_exception fault;
> >  };
> >  
> > -static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
> > +static gfn_t gpte_to_gfn_lvl(struct kvm_vcpu *vcpu, pt_element_t gpte, int lvl)
> >  {
> > -	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
> > +	return (gpte & PT_LVL_ADDR_MASK(vcpu, lvl)) >> PAGE_SHIFT;
> >  }
> >  
> >  static inline void FNAME(protect_clean_gpte)(struct kvm_mmu *mmu, unsigned *access,
> > @@ -395,7 +396,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
> >  		--walker->level;
> >  
> >  		index = PT_INDEX(addr, walker->level);
> > -		table_gfn = gpte_to_gfn(pte);
> > +		table_gfn = gpte_to_gfn(vcpu, pte);
> >  		offset    = index * sizeof(pt_element_t);
> >  		pte_gpa   = gfn_to_gpa(table_gfn) + offset;
> >  
> > @@ -460,7 +461,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
> >  	if (unlikely(errcode))
> >  		goto error;
> >  
> > -	gfn = gpte_to_gfn_lvl(pte, walker->level);
> > +	gfn = gpte_to_gfn_lvl(vcpu, pte, walker->level);
> >  	gfn += (addr & PT_LVL_OFFSET_MASK(walker->level)) >> PAGE_SHIFT;
> >  
> >  	if (PTTYPE == 32 && walker->level > PG_LEVEL_4K && is_cpuid_PSE36())
> > @@ -555,12 +556,14 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >  	gfn_t gfn;
> >  	kvm_pfn_t pfn;
> >  
> > +	WARN_ON(gpte & kvm_gpa_stolen_mask(vcpu->kvm));
> > +
> >  	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
> >  		return false;
> >  
> >  	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
> >  
> > -	gfn = gpte_to_gfn(gpte);
> > +	gfn = gpte_to_gfn(vcpu, gpte);
> >  	pte_access = sp->role.access & FNAME(gpte_access)(gpte);
> >  	FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
> >  
> > @@ -656,6 +659,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >  	WARN_ON_ONCE(gw->gfn != base_gfn);
> >  	direct_access = gw->pte_access;
> >  
> > +	WARN_ON(fault->addr & kvm_gpa_stolen_mask(vcpu->kvm));
> > +
> >  	top_level = vcpu->arch.mmu->root_level;
> >  	if (top_level == PT32E_ROOT_LEVEL)
> >  		top_level = PT32_ROOT_LEVEL;
> > @@ -1080,7 +1085,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> >  			continue;
> >  		}
> >  
> > -		gfn = gpte_to_gfn(gpte);
> > +		gfn = gpte_to_gfn(vcpu, gpte);
> >  		pte_access = sp->role.access;
> >  		pte_access &= FNAME(gpte_access)(gpte);
> >  		FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
> 
> In commit message you mentioned "Don't support stolen bits for shadow EPT" (you
> actually mean shadow MMU I suppose), yet there's bunch of code change to shadow
> MMU.

Those are not needed. I'll drop them.
-- 
Isaku Yamahata <isaku.yamahata@xxxxxxxxx>