Re: [PATCH v10 047/108] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

"Huang, Kai" <kai.huang@xxxxxxxxx> · Wed, 14 Dec 2022 11:17:32 +0000

On Sat, 2022-10-29 at 23:22 -0700, isaku.yamahata@xxxxxxxxx wrote:
> From: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> 
> TDX supports only write-back(WB) memory type for private memory
> architecturally so that (virtualized) memory type change doesn't make sense
> for private memory.  Also currently, page migration isn't supported for TDX
> yet. (TDX architecturally supports page migration. it's KVM and kernel
> implementation issue.)
> 
> Regarding memory type change (mtrr virtualization and lapic page mapping
> change), pages are zapped by kvm_zap_gfn_range().  On the next KVM page
> fault, the SPTE entry with a new memory type for the page is populated.
> Regarding page migration, pages are zapped by the mmu notifier. On the next
> KVM page fault, the new migrated page is populated.  Don't zap private
> pages on unmapping for those two cases.
> 
> When deleting/moving a KVM memory slot, zap private pages. Typically
> tearing down VM.  Don't invalidate private page tables. i.e. zap only leaf
> SPTEs for KVM mmu that has a shared bit mask. The existing
> kvm_tdp_mmu_invalidate_all_roots() depends on role.invalid with read-lock
> of mmu_lock so that other vcpu can operate on KVM mmu concurrently.  It
> marks the root page table invalid and zaps SPTEs of the root page
> tables. The TDX module doesn't allow to unlink a protected root page table
> from the hardware and then allocate a new one for it. i.e. replacing a
> protected root page table.  Instead, zap only leaf SPTEs for KVM mmu with a
> shared bit mask set.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> ---
>  arch/x86/kvm/mmu/mmu.c     | 85 ++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu/tdp_mmu.c | 24 ++++++++---
>  arch/x86/kvm/mmu/tdp_mmu.h |  5 ++-
>  3 files changed, 103 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index faf69774c7ce..0237e143299c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1577,8 +1577,38 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  	if (kvm_memslots_have_rmaps(kvm))
>  		flush = kvm_handle_gfn_range(kvm, range, kvm_zap_rmap);
>  
> -	if (is_tdp_mmu_enabled(kvm))
> -		flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush);
> +	if (is_tdp_mmu_enabled(kvm)) {
> +		bool zap_private;
> +
> +		if (kvm_slot_can_be_private(range->slot)) {
> +			if (range->flags & KVM_GFN_RANGE_FLAGS_RESTRICTED_MEM)
> +				/*
> +				 * For private slot, the callback is triggered
> +				 * via falloc.  Mode can be allocation or punch
				       ^
				       fallocate(), please?

> +				 * hole.  Because the private-shared conversion
> +				 * is done via
> +				 * KVM_MEMORY_ENCRYPT_REG/UNREG_REGION, we can
> +				 * ignore the request from restrictedmem.
> +				 */
> +				return flush;

Sorry why "private-shared conversion is done via KVM_MEMORY_ENCRYPT_REG" results
in "we can ignore the requres from restrictedmem"?

If we punch a hole, the pages are de-allocated, correct? 

> +			else if (range->flags & KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR) {
> +				if (range->attr == KVM_MEM_ATTR_SHARED)
> +					zap_private = true;
> +				else {
> +					WARN_ON_ONCE(range->attr != KVM_MEM_ATTR_PRIVATE);
> +					zap_private = false;
> +				}
> +			} else
> +				/*
> +				 * kvm_unmap_gfn_range() is called via mmu
> +				 * notifier.  For now page migration for private
> +				 * page isn't supported yet, don't zap private
> +				 * pages.
> +				 */
> +				zap_private = false;

Page migration is not the only reason that KVM will receive the MMU notifer --
just say something like "for now all private pages are pinned during VM's life 
time".

> +		}
> +		flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush, zap_private);
> +	}
>  
>  	return flush;
>  }
> @@ -6066,11 +6096,48 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
>  	return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
>  }
>  
> +static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> +{
> +	bool flush = false;
> +
> +	write_lock(&kvm->mmu_lock);
> +
> +	/*
> +	 * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> +	 * case scenario we'll have unused shadow pages lying around until they
> +	 * are recycled due to age or when the VM is destroyed.
> +	 */
> +	if (is_tdp_mmu_enabled(kvm)) {
> +		struct kvm_gfn_range range = {
> +		      .slot = slot,
> +		      .start = slot->base_gfn,
> +		      .end = slot->base_gfn + slot->npages,
> +		      .may_block = false,
> +		};
> +
> +		/*
> +		 * this handles both private gfn and shared gfn.
> +		 * All private page should be zapped on memslot deletion.
> +		 */
> +		flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range, flush, true);
> +	} else {
> +		flush = slot_handle_level(kvm, slot, __kvm_zap_rmap, PG_LEVEL_4K,
> +					  KVM_MAX_HUGEPAGE_LEVEL, true);
> +	}
> +	if (flush)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	write_unlock(&kvm->mmu_lock);
> +}
> +
>  static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>  			struct kvm_memory_slot *slot,
>  			struct kvm_page_track_notifier_node *node)
>  {
> -	kvm_mmu_zap_all_fast(kvm);
> +	if (kvm_gfn_shared_mask(kvm))
> +		kvm_mmu_zap_memslot(kvm, slot);
> +	else
> +		kvm_mmu_zap_all_fast(kvm);
>  }

A comment would be nice here.

>  
>  int kvm_mmu_init_vm(struct kvm *kvm)
> @@ -6173,8 +6240,18 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  
>  	if (is_tdp_mmu_enabled(kvm)) {
>  		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> +			/*
> +			 * zap_private = true. Zap both private/shared pages.
> +			 *
> +			 * kvm_zap_gfn_range() is used when PAT memory type was

Is it PAT or MTRR, or both (thus just memory type)?

> +			 * changed.  Later on the next kvm page fault, populate
> +			 * it with updated spte entry.
> +			 * Because only WB is supported for private pages, don't
> +			 * care of private pages.
> +			 */

Then why bother zapping private?  If I read correctly, the changelog says "don't
zap private"?

>  			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
> -						      gfn_end, true, flush);
> +						      gfn_end, true, flush,
> +						      true);
>  	}
>  

Btw, as you mentioned in the changelog, private memory always has WB memory
type, thus cannot be virtualized.  Is it better to modify update_mtrr() to just
return early if the gfn range is purely private?

IMHO the handling of MTRR/PAT virtualization for TDX guest deserves dedicated
patch(es) to put them together so it's easier to review.  Now the relevant parts
spread in multiple independent patches (MSR handling, vt_get_mt_mask(), etc).