On Tue, Apr 02, 2024 at 05:13:23PM +0800, Binbin Wu <binbin.wu@xxxxxxxxxxxxxxx> wrote: > > > On 2/26/2024 4:26 PM, isaku.yamahata@xxxxxxxxx wrote: > > From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > > > > Implement hooks of TDP MMU for TDX backend. TLB flush, TLB shootdown, > > propagating the change private EPT entry to Secure EPT and freeing Secure > > EPT page. TLB flush handles both shared EPT and private EPT. It flushes > > shared EPT same as VMX. It also waits for the TDX TLB shootdown. For the > > hook to free Secure EPT page, unlinks the Secure EPT page from the Secure > > EPT so that the page can be freed to OS. > > > > Propagate the entry change to Secure EPT. The possible entry changes are > > present -> non-present(zapping) and non-present -> present(population). On > > population just link the Secure EPT page or the private guest page to the > > Secure EPT by TDX SEAMCALL. Because TDP MMU allows concurrent > > zapping/population, zapping requires synchronous TLB shoot down with the > > frozen EPT entry. It zaps the secure entry, increments TLB counter, sends > > IPI to remote vcpus to trigger TLB flush, and then unlinks the private > > guest page from the Secure EPT. For simplicity, batched zapping with > > exclude lock is handled as concurrent zapping. Although it's inefficient, > > it can be optimized in the future. > > > > For MMIO SPTE, the spte value changes as follows. > > initial value (suppress VE bit is set) > > -> Guest issues MMIO and triggers EPT violation > > -> KVM updates SPTE value to MMIO value (suppress VE bit is cleared) > > -> Guest MMIO resumes. It triggers VE exception in guest TD > > -> Guest VE handler issues TDG.VP.VMCALL<MMIO> > > -> KVM handles MMIO > > -> Guest VE handler resumes its execution after MMIO instruction > > > > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > > > > --- > > v19: > > - Compile fix when CONFIG_HYPERV != y. > > It's due to the following patch. Catch it up. > > https://lore.kernel.org/all/20231018192325.1893896-1-seanjc@xxxxxxxxxx/ > > - Add comments on tlb shootdown to explan the sequence. > > - Use gmem_max_level callback, delete tdp_max_page_level. > > > > v18: > > - rename tdx_sept_page_aug() -> tdx_mem_page_aug() > > - checkpatch: space => tab > > > > v15 -> v16: > > - Add the handling of TD_ATTR_SEPT_VE_DISABLE case. > > > > v14 -> v15: > > - Implemented tdx_flush_tlb_current() > > - Removed unnecessary invept in tdx_flush_tlb(). It was carry over > > from the very old code base. > > > > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > > --- > > arch/x86/kvm/mmu/spte.c | 3 +- > > arch/x86/kvm/vmx/main.c | 91 ++++++++- > > arch/x86/kvm/vmx/tdx.c | 372 +++++++++++++++++++++++++++++++++++++ > > arch/x86/kvm/vmx/tdx.h | 2 +- > > arch/x86/kvm/vmx/tdx_ops.h | 6 + > > arch/x86/kvm/vmx/x86_ops.h | 13 ++ > > 6 files changed, 481 insertions(+), 6 deletions(-) > > > [...] > > > +static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, > > + enum pg_level level) > > +{ > > + int tdx_level = pg_level_to_tdx_sept_level(level); > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > > + gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level); > > + struct tdx_module_args out; > > + u64 err; > > + > > + /* This can be called when destructing guest TD after freeing HKID. */ > > + if (unlikely(!is_hkid_assigned(kvm_tdx))) > > + return 0; > > + > > + /* For now large page isn't supported yet. */ > > + WARN_ON_ONCE(level != PG_LEVEL_4K); > > + err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &out); > > + if (unlikely(err == TDX_ERROR_SEPT_BUSY)) > > + return -EAGAIN; > > + if (KVM_BUG_ON(err, kvm)) { > > + pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &out); > > + return -EIO; > > + } > > + return 0; > > +} > > + > > +/* > > + * TLB shoot down procedure: > > + * There is a global epoch counter and each vcpu has local epoch counter. > > + * - TDH.MEM.RANGE.BLOCK(TDR. level, range) on one vcpu > > + * This blocks the subsequenct creation of TLB translation on that range. > > + * This corresponds to clear the present bit(all RXW) in EPT entry > > + * - TDH.MEM.TRACK(TDR): advances the epoch counter which is global. > > + * - IPI to remote vcpus > > + * - TDExit and re-entry with TDH.VP.ENTER on remote vcpus > > + * - On re-entry, TDX module compares the local epoch counter with the global > > + * epoch counter. If the local epoch counter is older than the global epoch > > + * counter, update the local epoch counter and flushes TLB. > > + */ > > +static void tdx_track(struct kvm *kvm) > > +{ > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > > + u64 err; > > + > > + KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm); > > + /* If TD isn't finalized, it's before any vcpu running. */ > > + if (unlikely(!is_td_finalized(kvm_tdx))) > > + return; > > + > > + /* > > + * tdx_flush_tlb() waits for this function to issue TDH.MEM.TRACK() by > > + * the counter. The counter is used instead of bool because multiple > > + * TDH_MEM_TRACK() can be issued concurrently by multiple vcpus. > > Which case will have concurrent issues of TDH_MEM_TRACK() by multiple vcpus? > For now, zapping is holding write lock. > Promotion/demotion may have concurrent issues of TDH_MEM_TRACK(), but it's > not supported yet. You're right. Large page support will use it. With the assumption of only single vcpu issuing tlb flush, The alternative is boolean + memory barrier. I prefer to keep atomic_t and drop this comment than boolean + memory barrier because we will eventually switch to atomic_t. > > + * > > + * optimization: The TLB shoot down procedure described in The TDX > > + * specification is, TDH.MEM.TRACK(), send IPI to remote vcpus, confirm > > + * all remote vcpus exit to VMM, and execute vcpu, both local and > > + * remote. Twist the sequence to reduce IPI overhead as follows. > > + * > > + * local remote > > + * ----- ------ > > + * increment tdh_mem_track > > + * > > + * request KVM_REQ_TLB_FLUSH > > + * send IPI > > + * > > + * TDEXIT to KVM due to IPI > > + * > > + * IPI handler calls tdx_flush_tlb() > > + * to process KVM_REQ_TLB_FLUSH. > > + * spin wait for tdh_mem_track == 0 > > + * > > + * TDH.MEM.TRACK() > > + * > > + * decrement tdh_mem_track > > + * > > + * complete KVM_REQ_TLB_FLUSH > > + * > > + * TDH.VP.ENTER to flush tlbs TDH.VP.ENTER to flush tlbs > > + */ > > + atomic_inc(&kvm_tdx->tdh_mem_track); > > + /* > > + * KVM_REQ_TLB_FLUSH waits for the empty IPI handler, ack_flush(), with > > + * KVM_REQUEST_WAIT. > > + */ > > + kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH); > > + > > + do { > > + err = tdh_mem_track(kvm_tdx->tdr_pa); > > + } while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY)); > > + > > + /* Release remote vcpu waiting for TDH.MEM.TRACK in tdx_flush_tlb(). */ > > + atomic_dec(&kvm_tdx->tdh_mem_track); > > + > > + if (KVM_BUG_ON(err, kvm)) > > + pr_tdx_error(TDH_MEM_TRACK, err, NULL); > > + > > +} > > + > > +static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn, > > + enum pg_level level, void *private_spt) > > +{ > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > > + > > + /* > > + * The HKID assigned to this TD was already freed and cache was > > + * already flushed. We don't have to flush again. > > + */ > > + if (!is_hkid_assigned(kvm_tdx)) > > + return tdx_reclaim_page(__pa(private_spt)); > > + > > + /* > > + * free_private_spt() is (obviously) called when a shadow page is being > > + * zapped. KVM doesn't (yet) zap private SPs while the TD is active. > > + * Note: This function is for private shadow page. Not for private > > + * guest page. private guest page can be zapped during TD is active. > > + * shared <-> private conversion and slot move/deletion. > > + */ > > + KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm); > > At this point, is_hkid_assigned(kvm_tdx) is always true. Yes, will drop this KVM_BUG_ON(). -- Isaku Yamahata <isaku.yamahata@xxxxxxxxx>