Re: [PATCH v19 070/130] KVM: TDX: TDP MMU TDX support

Binbin Wu <binbin.wu@xxxxxxxxxxxxxxx> · Tue, 2 Apr 2024 17:13:23 +0800

On 2/26/2024 4:26 PM, isaku.yamahata@xxxxxxxxx wrote:
From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>

Implement hooks of TDP MMU for TDX backend.  TLB flush, TLB shootdown,
propagating the change private EPT entry to Secure EPT and freeing Secure
EPT page. TLB flush handles both shared EPT and private EPT.  It flushes
shared EPT same as VMX.  It also waits for the TDX TLB shootdown.  For the
hook to free Secure EPT page, unlinks the Secure EPT page from the Secure
EPT so that the page can be freed to OS.

Propagate the entry change to Secure EPT.  The possible entry changes are
present -> non-present(zapping) and non-present -> present(population).  On
population just link the Secure EPT page or the private guest page to the
Secure EPT by TDX SEAMCALL. Because TDP MMU allows concurrent
zapping/population, zapping requires synchronous TLB shoot down with the
frozen EPT entry.  It zaps the secure entry, increments TLB counter, sends
IPI to remote vcpus to trigger TLB flush, and then unlinks the private
guest page from the Secure EPT. For simplicity, batched zapping with
exclude lock is handled as concurrent zapping.  Although it's inefficient,
it can be optimized in the future.

For MMIO SPTE, the spte value changes as follows.
initial value (suppress VE bit is set)
-> Guest issues MMIO and triggers EPT violation
-> KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
-> Guest MMIO resumes.  It triggers VE exception in guest TD
-> Guest VE handler issues TDG.VP.VMCALL<MMIO>
-> KVM handles MMIO
-> Guest VE handler resumes its execution after MMIO instruction

Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>

---
v19:
- Compile fix when CONFIG_HYPERV != y.
   It's due to the following patch.  Catch it up.
   https://lore.kernel.org/all/20231018192325.1893896-1-seanjc@xxxxxxxxxx/
- Add comments on tlb shootdown to explan the sequence.
- Use gmem_max_level callback, delete tdp_max_page_level.

v18:
- rename tdx_sept_page_aug() -> tdx_mem_page_aug()
- checkpatch: space => tab

v15 -> v16:
- Add the handling of TD_ATTR_SEPT_VE_DISABLE case.

v14 -> v15:
- Implemented tdx_flush_tlb_current()
- Removed unnecessary invept in tdx_flush_tlb().  It was carry over
   from the very old code base.

Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
---
  arch/x86/kvm/mmu/spte.c    |   3 +-
  arch/x86/kvm/vmx/main.c    |  91 ++++++++-
  arch/x86/kvm/vmx/tdx.c     | 372 +++++++++++++++++++++++++++++++++++++
  arch/x86/kvm/vmx/tdx.h     |   2 +-
  arch/x86/kvm/vmx/tdx_ops.h |   6 +
  arch/x86/kvm/vmx/x86_ops.h |  13 ++
  6 files changed, 481 insertions(+), 6 deletions(-)

[...]

+static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
+				      enum pg_level level)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
+	struct tdx_module_args out;
+	u64 err;
+
+	/* This can be called when destructing guest TD after freeing HKID. */
+	if (unlikely(!is_hkid_assigned(kvm_tdx)))
+		return 0;
+
+	/* For now large page isn't supported yet. */
+	WARN_ON_ONCE(level != PG_LEVEL_4K);
+	err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
+	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+		return -EAGAIN;
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &out);
+		return -EIO;
+	}
+	return 0;
+}
+
+/*
+ * TLB shoot down procedure:
+ * There is a global epoch counter and each vcpu has local epoch counter.
+ * - TDH.MEM.RANGE.BLOCK(TDR. level, range) on one vcpu
+ *   This blocks the subsequenct creation of TLB translation on that range.
+ *   This corresponds to clear the present bit(all RXW) in EPT entry
+ * - TDH.MEM.TRACK(TDR): advances the epoch counter which is global.
+ * - IPI to remote vcpus
+ * - TDExit and re-entry with TDH.VP.ENTER on remote vcpus
+ * - On re-entry, TDX module compares the local epoch counter with the global
+ *   epoch counter.  If the local epoch counter is older than the global epoch
+ *   counter, update the local epoch counter and flushes TLB.
+ */
+static void tdx_track(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err;
+
+	KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm);
+	/* If TD isn't finalized, it's before any vcpu running. */
+	if (unlikely(!is_td_finalized(kvm_tdx)))
+		return;
+
+	/*
+	 * tdx_flush_tlb() waits for this function to issue TDH.MEM.TRACK() by
+	 * the counter.  The counter is used instead of bool because multiple
+	 * TDH_MEM_TRACK() can be issued concurrently by multiple vcpus.

Which case will have concurrent issues of TDH_MEM_TRACK() by multiple vcpus?
For now, zapping is holding write lock.
Promotion/demotion may have concurrent issues of TDH_MEM_TRACK(), but 
it's not supported yet.

+	 *
+	 * optimization: The TLB shoot down procedure described in The TDX
+	 * specification is, TDH.MEM.TRACK(), send IPI to remote vcpus, confirm
+	 * all remote vcpus exit to VMM, and execute vcpu, both local and
+	 * remote.  Twist the sequence to reduce IPI overhead as follows.
+	 *
+	 * local			remote
+	 * -----			------
+	 * increment tdh_mem_track
+	 *
+	 * request KVM_REQ_TLB_FLUSH
+	 * send IPI
+	 *
+	 *				TDEXIT to KVM due to IPI
+	 *
+	 *				IPI handler calls tdx_flush_tlb()
+	 *                              to process KVM_REQ_TLB_FLUSH.
+	 *				spin wait for tdh_mem_track == 0
+	 *
+	 * TDH.MEM.TRACK()
+	 *
+	 * decrement tdh_mem_track
+	 *
+	 *				complete KVM_REQ_TLB_FLUSH
+	 *
+	 * TDH.VP.ENTER to flush tlbs	TDH.VP.ENTER to flush tlbs
+	 */
+	atomic_inc(&kvm_tdx->tdh_mem_track);
+	/*
+	 * KVM_REQ_TLB_FLUSH waits for the empty IPI handler, ack_flush(), with
+	 * KVM_REQUEST_WAIT.
+	 */
+	kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
+
+	do {
+		err = tdh_mem_track(kvm_tdx->tdr_pa);
+	} while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
+
+	/* Release remote vcpu waiting for TDH.MEM.TRACK in tdx_flush_tlb(). */
+	atomic_dec(&kvm_tdx->tdh_mem_track);
+
+	if (KVM_BUG_ON(err, kvm))
+		pr_tdx_error(TDH_MEM_TRACK, err, NULL);
+
+}
+
+static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
+				     enum pg_level level, void *private_spt)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	/*
+	 * The HKID assigned to this TD was already freed and cache was
+	 * already flushed. We don't have to flush again.
+	 */
+	if (!is_hkid_assigned(kvm_tdx))
+		return tdx_reclaim_page(__pa(private_spt));
+
+	/*
+	 * free_private_spt() is (obviously) called when a shadow page is being
+	 * zapped.  KVM doesn't (yet) zap private SPs while the TD is active.
+	 * Note: This function is for private shadow page.  Not for private
+	 * guest page.   private guest page can be zapped during TD is active.
+	 * shared <-> private conversion and slot move/deletion.
+	 */
+	KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm);

At this point, is_hkid_assigned(kvm_tdx) is always true.