Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

Isaku Yamahata <isaku.yamahata@xxxxxxxxx> · Tue, 23 Apr 2024 16:50:13 -0700

On Thu, Apr 04, 2024 at 02:50:31PM -0400,
Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:

> During guest run-time, kvm_arch_gmem_prepare() is issued as needed to
> prepare newly-allocated gmem pages prior to mapping them into the guest.
> In the case of SEV-SNP, this mainly involves setting the pages to
> private in the RMP table.
> 
> However, for the GPA ranges comprising the initial guest payload, which
> are encrypted/measured prior to starting the guest, the gmem pages need
> to be accessed prior to setting them to private in the RMP table so they
> can be initialized with the userspace-provided data. Additionally, an
> SNP firmware call is needed afterward to encrypt them in-place and
> measure the contents into the guest's launch digest.
> 
> While it is possible to bypass the kvm_arch_gmem_prepare() hooks so that
> this handling can be done in an open-coded/vendor-specific manner, this
> may expose more gmem-internal state/dependencies to external callers
> than necessary. Try to avoid this by implementing an interface that
> tries to handle as much of the common functionality inside gmem as
> possible, while also making it generic enough to potentially be
> usable/extensible for TDX as well.

I explored how TDX will use this hook.  However, it resulted in not using this
hook, and instead used kvm_tdp_mmu_get_walk() with a twist.  The patch is below.

Because SEV-SNP manages the RMP that is not tied to NPT directly, SEV-SNP can
ignore TDP MMU page tables when updating RMP.
On the other hand, TDX essentially updates Secure-EPT when it adds a page to
the guest by TDH.MEM.PAGE.ADD().  It needs to protect KVM TDP MMU page tables
with mmu_lock, not guest memfd file mapping with invalidate_lock.  The hook
doesn't apply to TDX well.  The resulted KVM_TDX_INIT_MEM_REGION logic is as
follows.

  get_user_pages_fast(source addr)
  read_lock(mmu_lock)
  kvm_tdp_mmu_get_walk_private_pfn(vcpu, gpa, &pfn);
  if the page table doesn't map gpa, error.
  TDH.MEM.PAGE.ADD()
  TDH.MR.EXTEND()
  read_unlock(mmu_lock)
  put_page()


>From 7d4024049b51969a2431805c2117992fc7ec0981 Mon Sep 17 00:00:00 2001
Message-ID: <7d4024049b51969a2431805c2117992fc7ec0981.1713913379.git.isaku.yamahata@xxxxxxxxx>
In-Reply-To: <cover.1713913379.git.isaku.yamahata@xxxxxxxxx>
References: <cover.1713913379.git.isaku.yamahata@xxxxxxxxx>
From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
Date: Tue, 23 Apr 2024 11:33:44 -0700
Subject: [PATCH] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU

KVM_TDX_INIT_MEM_REGION needs to check if the given GFN is already
populated.  Add wrapping logic to kvm_tdp_mmu_get_walk() to export it.

Alternatives are as follows.  Choose the approach of this patch as the
least intrusive change.
- Refactor kvm page fault handler.  Populating part and unlock function.
  The page fault handler to populate with keeping lock, TDH.MEM.PAGE.ADD(),
  unlock.
- Add a callback function to struct kvm_page_fault and call it
  after the page fault handler before unlocking mmu_lock and releasing PFN.

Based on the feedback of
https://lore.kernel.org/kvm/ZfBkle1eZFfjPI8l@xxxxxxxxxx/
https://lore.kernel.org/kvm/Zh8DHbb8FzoVErgX@xxxxxxxxxx/

Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
---
 arch/x86/kvm/mmu.h         |  3 +++
 arch/x86/kvm/mmu/tdp_mmu.c | 44 ++++++++++++++++++++++++++++++++------
 2 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 712e9408f634..4f61f4b9fd64 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -287,6 +287,9 @@ extern bool tdp_mmu_enabled;
 #define tdp_mmu_enabled false
 #endif
 
+int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
+				     kvm_pfn_t *pfn);
+
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
 	return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3592ae4e485f..bafcd8aeb3b3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -2035,14 +2035,25 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
  *
  * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
  */
-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
-			 int *root_level)
+static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+				  bool is_private)
 {
 	struct tdp_iter iter;
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	int leaf = -1;
 
+	tdp_mmu_for_each_pte(iter, mmu, is_private, gfn, gfn + 1) {
+		leaf = iter.level;
+		sptes[leaf] = iter.old_spte;
+	}
+
+	return leaf;
+}
+
+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+			 int *root_level)
+{
 	*root_level = vcpu->arch.mmu->root_role.level;
 
 	/*
@@ -2050,15 +2061,34 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 	 * instructions in protected guest memory can't be parsed by VMM.
 	 */
 	if (WARN_ON_ONCE(kvm_gfn_shared_mask(vcpu->kvm)))
-		return leaf;
+		return -1;
 
-	tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
-		leaf = iter.level;
-		sptes[leaf] = iter.old_spte;
+	return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, false);
+}
+
+int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
+				     kvm_pfn_t *pfn)
+{
+	u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
+	int leaf;
+
+	lockdep_assert_held(&vcpu->kvm->mmu_lock);
+
+	kvm_tdp_mmu_walk_lockless_begin();
+	leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, true);
+	kvm_tdp_mmu_walk_lockless_end();
+	if (leaf < 0)
+		return -ENOENT;
+
+	spte = sptes[leaf];
+	if (is_shadow_present_pte(spte) && is_last_spte(spte, leaf)) {
+		*pfn = spte_to_pfn(spte);
+		return leaf;
 	}
 
-	return leaf;
+	return -ENOENT;
 }
+EXPORT_SYMBOL_GPL(kvm_tdp_mmu_get_walk_private_pfn);
 
 /*
  * Returns the last level spte pointer of the shadow page walk for the given
-- 
2.43.2

-- 
Isaku Yamahata <isaku.yamahata@xxxxxxxxx>