On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote: > On Thu, Sep 12, 2024, Isaku Yamahata wrote: > > KVM MMU behavior > > ================ > > The leaf SPTE state machine is coded in make_spte(). Consider AD bits and > > the present bits for simplicity. The two functionalities and AD bits > > support are related in this context. unsync (manipulate D bit and W bit, > > and handle write protect fault) and access tracking (manipulate A bit and > > present bit, and hand handle page fault). (We don't consider dirty page > > tracking for now as it's future work of TDX KVM.) > > > > * If AD bit is enabled: > > D bit state change for dirty page tracking > > On the first EPT violation without prefetch, > > - D bits are set. > > - Make SPTE writable as TDX supports only RXW (or if write fault). > > (TDX KVM doesn't support write protection at this state. It's future work.) > > > > On the second EPT violation. > > - clear D bits if write fault. > > Heh, I was literally just writing changelogs for patches to address this (I told > Sagi I would do it "this week"; that was four weeks ago). > > This is a bug in make_spte(). Replacing a W=1,D=1 SPTE with a W=1,D=0 SPTE is > nonsensical. And I'm pretty sure it's an outright but for the TDP MMU (see below). > > Right now, the fixes for make_spte() are sitting toward the end of the massive > kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am > fairly confident that series can land in 6.13 (lots and lots of small patches). > > --- > Author: Sean Christopherson <seanjc@xxxxxxxxxx> > AuthorDate: Thu Sep 12 16:23:21 2024 -0700 > Commit: Sean Christopherson <seanjc@xxxxxxxxxx> > CommitDate: Thu Sep 12 16:35:06 2024 -0700 > > KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits > > Do a remote TLB flush if installing a leaf SPTE overwrites an existing > leaf SPTE (with the same target pfn) and clears the Writable bit or the > Dirty bit. KVM isn't _supposed_ to clear Writable or Dirty bits in such > a scenario, but make_spte() has a flaw where it will fail to set the Dirty > if the existing SPTE is writable. > > E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1 > SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the > second vCPU. If the first vCPU (or another vCPU) accesses memory using > the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is > the only SPTE that is dirty at the time of the next relevant clearing of > the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs > because it sees the D=0 SPTE, and thus will complete the clearing of the > dirty logs without performing a TLB flush. But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no matter clear_dirty_gfn_range() finds a D bit or not. kvm_mmu_slot_apply_flags |kvm_mmu_slot_leaf_clear_dirty | |kvm_tdp_mmu_clear_dirty_slot | |clear_dirty_gfn_range |kvm_flush_remote_tlbs_memslot > Opportunistically harden the TDP MMU against clearing the Writable bit as > well, both to prevent similar bugs for write-protection, but also so that > the logic can eventually be deduplicated into spte.c (mmu_spte_update() in > the shadow MMU has similar logic). > > Fix the bug in the TDP MMU's page fault handler even though make_spte() is > clearly doing odd things, e.g. it marks the page dirty in its slot but > doesn't set the Dirty bit. Precisely because replacing a leaf SPTE with > another leaf SPTE is so rare, the cost of hardening KVM against such bugs > is negligible. The make_spte() will be addressed in a future commit. > > Fixes: bb18842e2111 ("kvm: x86/mmu: Add TDP MMU PF handler") > Cc: David Matlack <dmatlack@xxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 3b996c1fdaab..7c6d1c610f0e 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1038,7 +1038,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, > else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte)) > return RET_PF_RETRY; > else if (is_shadow_present_pte(iter->old_spte) && > - !is_last_spte(iter->old_spte, iter->level)) > + (!is_last_spte(iter->old_spte, iter->level) || > + (is_mmu_writable_spte(old_spte) && !is_writable_pte(new_spte)) || > + (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)))) Should we also check is_accessed_spte() as what's done in mmu_spte_update()? It is possible for make_spte() to make the second spte !is_dirty_spte(), e.g. the second one is caused by a KVM_PRE_FAULT_MEMORY ioctl. > kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level); > > /* > --- > > > arch/x86/kvm/mmu/spte.h | 6 ++++++ > > arch/x86/kvm/mmu/tdp_mmu.c | 28 +++++++++++++++++++++++++--- > > 2 files changed, 31 insertions(+), 3 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h > > index a72f0e3bde17..1726f8ec5a50 100644 > > --- a/arch/x86/kvm/mmu/spte.h > > +++ b/arch/x86/kvm/mmu/spte.h > > @@ -214,6 +214,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask; > > */ > > #define FROZEN_SPTE (SHADOW_NONPRESENT_VALUE | 0x5a0ULL) > > > > +#define EXTERNAL_SPTE_IGNORE_CHANGE_MASK \ > > + (shadow_acc_track_mask | \ > > + (SHADOW_ACC_TRACK_SAVED_BITS_MASK << \ > > + SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) | \ > > Just make TDX require A/D bits, there's no reason to care about access tracking. If KVM_PRE_FAULT_MEMORY is allowed for TDX and if cpu_has_vmx_ept_ad_bits() is false in TDX's hardware (not sure if it's possible), access tracking bit is possbile to be changed, as in below scenario: 1. KVM_PRE_FAULT_MEMORY ioctl calls kvm_arch_vcpu_pre_fault_memory() to map a GFN, and make_spte() will call mark_spte_for_access_track() to remove shadow_acc_track_mask (i.e. RWX bits) and move R+X left to SHADOW_ACC_TRACK_SAVED_BITS_SHIFT. 2. If a concurrent page fault occurs on the same GFN on another vCPU, then make_spte() in that vCPU will not see prefetch and the new_spte is with RWX bits and with no bits set in SHADOW_ACC_TRACK_SAVED_MASK. > > > + shadow_dirty_mask | shadow_accessed_mask) > > + > > /* Removed SPTEs must not be misconstrued as shadow present PTEs. */ > > static_assert(!(FROZEN_SPTE & SPTE_MMU_PRESENT_MASK)); > > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > index 019b43723d90..cfb82ede8982 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > @@ -503,8 +503,6 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp > > kvm_pfn_t new_pfn = spte_to_pfn(new_spte); > > int ret = 0; > > > > - KVM_BUG_ON(was_present, kvm); > > - > > lockdep_assert_held(&kvm->mmu_lock); > > /* > > * We need to lock out other updates to the SPTE until the external > > @@ -519,10 +517,34 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp > > * external page table, or leaf. > > */ > > if (is_leaf) { > > - ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn); > > + /* > > + * SPTE is state machine with software available bits used. > > + * Check if the change is interesting to the backend. > > + */ > > + if (!was_present) > > + ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn); > > + else { > > + /* > > + * The external PTEs don't need updates for some bits, > > + * but if others are changed, bug the VM. > > + */ > > + if (KVM_BUG_ON(~EXTERNAL_SPTE_IGNORE_CHANGE_MASK & > > There's no reason to bug the VM. WARN, yes (maybe), but not bug the VM. And I > think this should be short-circuited somewhere further up the chain, i.e. not > just for TDX. > > One idea would be to WARN and skip setting the SPTE in tdp_mmu_map_handle_target_level(). > I.e. WARN and ignore 1=>0 transitions for Writable and Dirty bits, and then drop > the TLB flush (assuming the above patch lands). > > > + (old_spte ^ new_spte), kvm)) { > > Curly braces are unnecessary. > > > + ret = -EIO; > > + } > > + } > > + > > + /* > > + * The backend shouldn't return an error except EAGAIN. > > + * It's hard to debug without those info. > > + */ > > + if (ret && ret != EAGAIN) > > + pr_debug("gfn 0x%llx old_spte 0x%llx new_spte 0x%llx level %d\n", > > + gfn, old_spte, new_spte, level); > > Please no. Not in upstream. Yeah, for development it's fine, but sprinkling > printks all over the MMU eventually just results in stale printks, e.g. see all > of the pgprintk crud that got ripped out a while back. > > > } else { > > void *external_spt = get_external_spt(gfn, new_spte, level); > > > > + KVM_BUG_ON(was_present, kvm); > > KVM_BUG_ON(!external_spt, kvm); > > ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, external_spt); > > } > > > > base-commit: d2c7662a6ea1c325a9ae878b3f1a265264bcd18b > > -- > > 2.45.2 > > >