On Fri, Jan 10, 2025 at 2:47 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Tue, Nov 05, 2024, James Houghton wrote: > > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h > > index a24fca3f9e7f..f26d0b60d2dd 100644 > > --- a/arch/x86/kvm/mmu/tdp_iter.h > > +++ b/arch/x86/kvm/mmu/tdp_iter.h > > @@ -39,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte) > > } > > > > /* > > - * SPTEs must be modified atomically if they are shadow-present, leaf > > - * SPTEs, and have volatile bits, i.e. has bits that can be set outside > > - * of mmu_lock. The Writable bit can be set by KVM's fast page fault > > - * handler, and Accessed and Dirty bits can be set by the CPU. > > + * SPTEs must be modified atomically if they have bits that can be set outside > > + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the > > + * Writable bit can be set by KVM's fast page fault handler, the Accessed and > > + * Dirty bits can be set by the CPU, and the Accessed and W/R/X bits can be > > + * cleared by age_gfn_range(). > > * > > * Note, non-leaf SPTEs do have Accessed bits and those bits are > > * technically volatile, but KVM doesn't consume the Accessed bit of > > @@ -53,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte) > > static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level) > > { > > return is_shadow_present_pte(old_spte) && > > - is_last_spte(old_spte, level) && > > - spte_has_volatile_bits(old_spte); > > + is_last_spte(old_spte, level); > > I don't like this change on multiple fronts. First and foremost, it results in > spte_has_volatile_bits() being wrong for the TDP MMU. Second, the same logic > applies to the shadow MMU; the rmap lock prevents a use-after-free of the page > that owns the SPTE, but the zapping of the SPTE happens before the writer grabs > the rmap lock. Thanks Sean, yes I forgot about the shadow MMU case. > Lastly, I'm very, very tempted to say we should omit Accessed state from > spte_has_volatile_bits() and rename it to something like spte_needs_atomic_write(). > KVM x86 no longer flushes TLBs on aging, so we're already committed to incorrectly > thinking a page is old in rare cases, for the sake of performance. The odds of > KVM clobbering the Accessed bit are probably smaller than the odds of missing an > Accessed update due to a stale TLB entry. > > Note, only the shadow_accessed_mask check can be removed. KVM needs to ensure > access-tracked SPTEs are zapped properly, and dirty logging can't have false > negatives. I've dropped the change to kvm_tdp_mmu_spte_need_atomic_write() and instead applied this diff. --- a/arch/x86/kvm/mmu/spte.c +++ b/arch/x86/kvm/mmu/spte.c @@ -142,8 +142,14 @@ bool spte_has_volatile_bits(u64 spte) return true; if (spte_ad_enabled(spte)) { - if (!(spte & shadow_accessed_mask) || - (is_writable_pte(spte) && !(spte & shadow_dirty_mask))) + /* + * Do not check the Accessed bit. It can be set (by the CPU) + * and cleared (by kvm_tdp_mmu_age_spte()) without holding + * the mmu_lock, but when clearing the Accessed bit, we do + * not invalidate the TLB, so we can already miss Accessed bit + * updates. + */ + if (is_writable_pte(spte) && !(spte & shadow_dirty_mask)) return true; } I've also included a new patch to rename spte_has_volatile_bits() to spte_needs_atomic_write() like you suggested. I merely renamed it in all locations, including documentation; I haven't reworded the documentation's use of the word "volatile." > > > } > > > > static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte, > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > index 4508d868f1cd..f5b4f1060fff 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > @@ -178,6 +178,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, > > ((_only_valid) && (_root)->role.invalid))) { \ > > } else > > > > +/* > > + * Iterate over all TDP MMU roots in an RCU read-side critical section. > > Heh, that's pretty darn obvious. It would be far more helpful if the comment > explained the usage rules, e.g. what is safe (at a high level). How's this? +/* + * Iterate over all TDP MMU roots in an RCU read-side critical section. + * It is safe to iterate over the SPTEs under the root, but their values will + * be unstable, so all writes must be atomic. As this routine is meant to be + * used without holding the mmu_lock at all, any bits that are flipped must + * be reflected in kvm_tdp_mmu_spte_need_atomic_write(). + */ > > + */ > > +#define for_each_valid_tdp_mmu_root_rcu(_kvm, _root, _as_id) \ > > + list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link) \ > > + if ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) || \ > > + (_root)->role.invalid) { \ > > + } else > > + > > #define for_each_tdp_mmu_root(_kvm, _root, _as_id) \ > > __for_each_tdp_mmu_root(_kvm, _root, _as_id, false) > > > > @@ -1168,16 +1177,16 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter) > > u64 new_spte; > > > > if (spte_ad_enabled(iter->old_spte)) { > > - iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep, > > - iter->old_spte, > > - shadow_accessed_mask, > > - iter->level); > > + iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep, > > + shadow_accessed_mask); > > Align, and let this poke past 80: > > iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep, > shadow_accessed_mask); Done. > > new_spte = iter->old_spte & ~shadow_accessed_mask; > > } else { > > new_spte = mark_spte_for_access_track(iter->old_spte); > > - iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep, > > - iter->old_spte, new_spte, > > - iter->level); > > + /* > > + * It is safe for the following cmpxchg to fail. Leave the > > + * Accessed bit set, as the spte is most likely young anyway. > > + */ > > + (void)__tdp_mmu_set_spte_atomic(iter, new_spte); > > Just a reminder that this needs to be: > > if (__tdp_mmu_set_spte_atomic(iter, new_spte)) > return; > Already applied, thanks! > > } > > > > trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level, > > -- > > 2.47.0.199.ga7371fff76-goog > >