Re: [PATCH v19 070/130] KVM: TDX: TDP MMU TDX support

Isaku Yamahata <isaku.yamahata@xxxxxxxxx> · Wed, 3 Apr 2024 11:01:49 -0700

On Tue, Apr 02, 2024 at 05:13:23PM +0800,
Binbin Wu <binbin.wu@xxxxxxxxxxxxxxx> wrote:

> 
> 
> On 2/26/2024 4:26 PM, isaku.yamahata@xxxxxxxxx wrote:
> > From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> > 
> > Implement hooks of TDP MMU for TDX backend.  TLB flush, TLB shootdown,
> > propagating the change private EPT entry to Secure EPT and freeing Secure
> > EPT page. TLB flush handles both shared EPT and private EPT.  It flushes
> > shared EPT same as VMX.  It also waits for the TDX TLB shootdown.  For the
> > hook to free Secure EPT page, unlinks the Secure EPT page from the Secure
> > EPT so that the page can be freed to OS.
> > 
> > Propagate the entry change to Secure EPT.  The possible entry changes are
> > present -> non-present(zapping) and non-present -> present(population).  On
> > population just link the Secure EPT page or the private guest page to the
> > Secure EPT by TDX SEAMCALL. Because TDP MMU allows concurrent
> > zapping/population, zapping requires synchronous TLB shoot down with the
> > frozen EPT entry.  It zaps the secure entry, increments TLB counter, sends
> > IPI to remote vcpus to trigger TLB flush, and then unlinks the private
> > guest page from the Secure EPT. For simplicity, batched zapping with
> > exclude lock is handled as concurrent zapping.  Although it's inefficient,
> > it can be optimized in the future.
> > 
> > For MMIO SPTE, the spte value changes as follows.
> > initial value (suppress VE bit is set)
> > -> Guest issues MMIO and triggers EPT violation
> > -> KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
> > -> Guest MMIO resumes.  It triggers VE exception in guest TD
> > -> Guest VE handler issues TDG.VP.VMCALL<MMIO>
> > -> KVM handles MMIO
> > -> Guest VE handler resumes its execution after MMIO instruction
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> > 
> > ---
> > v19:
> > - Compile fix when CONFIG_HYPERV != y.
> >    It's due to the following patch.  Catch it up.
> >    https://lore.kernel.org/all/20231018192325.1893896-1-seanjc@xxxxxxxxxx/
> > - Add comments on tlb shootdown to explan the sequence.
> > - Use gmem_max_level callback, delete tdp_max_page_level.
> > 
> > v18:
> > - rename tdx_sept_page_aug() -> tdx_mem_page_aug()
> > - checkpatch: space => tab
> > 
> > v15 -> v16:
> > - Add the handling of TD_ATTR_SEPT_VE_DISABLE case.
> > 
> > v14 -> v15:
> > - Implemented tdx_flush_tlb_current()
> > - Removed unnecessary invept in tdx_flush_tlb().  It was carry over
> >    from the very old code base.
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> > ---
> >   arch/x86/kvm/mmu/spte.c    |   3 +-
> >   arch/x86/kvm/vmx/main.c    |  91 ++++++++-
> >   arch/x86/kvm/vmx/tdx.c     | 372 +++++++++++++++++++++++++++++++++++++
> >   arch/x86/kvm/vmx/tdx.h     |   2 +-
> >   arch/x86/kvm/vmx/tdx_ops.h |   6 +
> >   arch/x86/kvm/vmx/x86_ops.h |  13 ++
> >   6 files changed, 481 insertions(+), 6 deletions(-)
> > 
> [...]
> 
> > +static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> > +				      enum pg_level level)
> > +{
> > +	int tdx_level = pg_level_to_tdx_sept_level(level);
> > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
> > +	struct tdx_module_args out;
> > +	u64 err;
> > +
> > +	/* This can be called when destructing guest TD after freeing HKID. */
> > +	if (unlikely(!is_hkid_assigned(kvm_tdx)))
> > +		return 0;
> > +
> > +	/* For now large page isn't supported yet. */
> > +	WARN_ON_ONCE(level != PG_LEVEL_4K);
> > +	err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
> > +	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
> > +		return -EAGAIN;
> > +	if (KVM_BUG_ON(err, kvm)) {
> > +		pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &out);
> > +		return -EIO;
> > +	}
> > +	return 0;
> > +}
> > +
> > +/*
> > + * TLB shoot down procedure:
> > + * There is a global epoch counter and each vcpu has local epoch counter.
> > + * - TDH.MEM.RANGE.BLOCK(TDR. level, range) on one vcpu
> > + *   This blocks the subsequenct creation of TLB translation on that range.
> > + *   This corresponds to clear the present bit(all RXW) in EPT entry
> > + * - TDH.MEM.TRACK(TDR): advances the epoch counter which is global.
> > + * - IPI to remote vcpus
> > + * - TDExit and re-entry with TDH.VP.ENTER on remote vcpus
> > + * - On re-entry, TDX module compares the local epoch counter with the global
> > + *   epoch counter.  If the local epoch counter is older than the global epoch
> > + *   counter, update the local epoch counter and flushes TLB.
> > + */
> > +static void tdx_track(struct kvm *kvm)
> > +{
> > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +	u64 err;
> > +
> > +	KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm);
> > +	/* If TD isn't finalized, it's before any vcpu running. */
> > +	if (unlikely(!is_td_finalized(kvm_tdx)))
> > +		return;
> > +
> > +	/*
> > +	 * tdx_flush_tlb() waits for this function to issue TDH.MEM.TRACK() by
> > +	 * the counter.  The counter is used instead of bool because multiple
> > +	 * TDH_MEM_TRACK() can be issued concurrently by multiple vcpus.
> 
> Which case will have concurrent issues of TDH_MEM_TRACK() by multiple vcpus?
> For now, zapping is holding write lock.
> Promotion/demotion may have concurrent issues of TDH_MEM_TRACK(), but it's
> not supported yet.

You're right. Large page support will use it.  With the assumption of only
single vcpu issuing tlb flush, The alternative is boolean + memory barrier.
I prefer to keep atomic_t and drop this comment than boolean + memory barrier
because we will eventually switch to atomic_t.

> > +	 *
> > +	 * optimization: The TLB shoot down procedure described in The TDX
> > +	 * specification is, TDH.MEM.TRACK(), send IPI to remote vcpus, confirm
> > +	 * all remote vcpus exit to VMM, and execute vcpu, both local and
> > +	 * remote.  Twist the sequence to reduce IPI overhead as follows.
> > +	 *
> > +	 * local			remote
> > +	 * -----			------
> > +	 * increment tdh_mem_track
> > +	 *
> > +	 * request KVM_REQ_TLB_FLUSH
> > +	 * send IPI
> > +	 *
> > +	 *				TDEXIT to KVM due to IPI
> > +	 *
> > +	 *				IPI handler calls tdx_flush_tlb()
> > +	 *                              to process KVM_REQ_TLB_FLUSH.
> > +	 *				spin wait for tdh_mem_track == 0
> > +	 *
> > +	 * TDH.MEM.TRACK()
> > +	 *
> > +	 * decrement tdh_mem_track
> > +	 *
> > +	 *				complete KVM_REQ_TLB_FLUSH
> > +	 *
> > +	 * TDH.VP.ENTER to flush tlbs	TDH.VP.ENTER to flush tlbs
> > +	 */
> > +	atomic_inc(&kvm_tdx->tdh_mem_track);
> > +	/*
> > +	 * KVM_REQ_TLB_FLUSH waits for the empty IPI handler, ack_flush(), with
> > +	 * KVM_REQUEST_WAIT.
> > +	 */
> > +	kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
> > +
> > +	do {
> > +		err = tdh_mem_track(kvm_tdx->tdr_pa);
> > +	} while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
> > +
> > +	/* Release remote vcpu waiting for TDH.MEM.TRACK in tdx_flush_tlb(). */
> > +	atomic_dec(&kvm_tdx->tdh_mem_track);
> > +
> > +	if (KVM_BUG_ON(err, kvm))
> > +		pr_tdx_error(TDH_MEM_TRACK, err, NULL);
> > +
> > +}
> > +
> > +static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
> > +				     enum pg_level level, void *private_spt)
> > +{
> > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +
> > +	/*
> > +	 * The HKID assigned to this TD was already freed and cache was
> > +	 * already flushed. We don't have to flush again.
> > +	 */
> > +	if (!is_hkid_assigned(kvm_tdx))
> > +		return tdx_reclaim_page(__pa(private_spt));
> > +
> > +	/*
> > +	 * free_private_spt() is (obviously) called when a shadow page is being
> > +	 * zapped.  KVM doesn't (yet) zap private SPs while the TD is active.
> > +	 * Note: This function is for private shadow page.  Not for private
> > +	 * guest page.   private guest page can be zapped during TD is active.
> > +	 * shared <-> private conversion and slot move/deletion.
> > +	 */
> > +	KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm);
> 
> At this point, is_hkid_assigned(kvm_tdx) is always true.

Yes, will drop this KVM_BUG_ON().
-- 
Isaku Yamahata <isaku.yamahata@xxxxxxxxx>