On Mon, Sep 30, 2024 at 08:54:38AM -0700, Sean Christopherson wrote: > On Sun, Sep 29, 2024, Yan Zhao wrote: > > On Fri, Sep 27, 2024 at 07:32:10AM -0700, Sean Christopherson wrote: > > > On Fri, Sep 27, 2024, Sean Christopherson wrote: > > > > On Thu, Sep 26, 2024, Yan Zhao wrote: > > > > > On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote: > > > > > > On Thu, Sep 12, 2024, Isaku Yamahata wrote: > > > > > > Right now, the fixes for make_spte() are sitting toward the end of the massive > > > > > > kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am > > > > > > fairly confident that series can land in 6.13 (lots and lots of small patches). > > > > > > > > > > > > --- > > > > > > Author: Sean Christopherson <seanjc@xxxxxxxxxx> > > > > > > AuthorDate: Thu Sep 12 16:23:21 2024 -0700 > > > > > > Commit: Sean Christopherson <seanjc@xxxxxxxxxx> > > > > > > CommitDate: Thu Sep 12 16:35:06 2024 -0700 > > > > > > > > > > > > KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits > > > > > > > > > > > > Do a remote TLB flush if installing a leaf SPTE overwrites an existing > > > > > > leaf SPTE (with the same target pfn) and clears the Writable bit or the > > > > > > Dirty bit. KVM isn't _supposed_ to clear Writable or Dirty bits in such > > > > > > a scenario, but make_spte() has a flaw where it will fail to set the Dirty > > > > > > if the existing SPTE is writable. > > > > > > > > > > > > E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1 > > > > > > SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the > > > > > > second vCPU. If the first vCPU (or another vCPU) accesses memory using > > > > > > the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is > > > > > > the only SPTE that is dirty at the time of the next relevant clearing of > > > > > > the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs > > > > > > because it sees the D=0 SPTE, and thus will complete the clearing of the > > > > > > dirty logs without performing a TLB flush. > > > > > But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no > > > > > matter clear_dirty_gfn_range() finds a D bit or not. > > > > > > > > Oh, right, I forgot about that. I'll tweak the changelog to call that out before > > > > posting. Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm: > > > > x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if > > > > anything should be backported it's that commit. > > > > > > Actually, a better idea. I think it makes sense to fully commit to not flushing > > > when overwriting SPTEs, and instead rely on the dirty logging logic to do a remote > > > TLB flush. > > > > > > E.g. on top of this change in the mega-series is a cleanup to unify the TDP MMU > > > and shadow MMU logic for clearing Writable and Dirty bits, with this comment > > > (which is a massaged version of an existing comment for mmu_spte_update()): > > > > > > /* > > > * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote > > > * TLBs must be flushed. Otherwise write-protecting the gfn may find a read- > > > * only SPTE, even though the writable SPTE might be cached in a CPU's TLB. > > > * > > > * Remote TLBs also need to be flushed if the Dirty bit is cleared, as false > > > * negatives are not acceptable, e.g. if KVM is using D-bit based PML on VMX. > > > * > > > * Don't flush if the Accessed bit is cleared, as access tracking tolerates > > > * false negatives, and the one path that does care about TLB flushes, > > > * kvm_mmu_notifier_clear_flush_young(), uses mmu_spte_update_no_track(). > > I have a question about why access tracking tolerates false negatives on the > > path kvm_mmu_notifier_clear_flush_young(). > > > > kvm_mmu_notifier_clear_flush_young() invokes kvm_flush_remote_tlbs() > > only when kvm_age_gfn() returns true. But age_gfn_range()/kvm_age_rmap() will > > return false if the old spte is !is_accessed_spte(). > > > > So, if the Access bit is cleared in make_spte(), is a TLB flush required to > > avoid that it's not done in kvm_mmu_notifier_clear_flush_young()? > > Because access tracking in general is tolerant of stale results due to lack of > TLB flushes. E.g. on many architectures, the primary MMU has omitted TLB flushes > for years (10+ years on x86, commit b13b1d2d8692). The basic argument is that if > there is enough memory pressure to trigger reclaim, then there will be enough TLB > pressure to ensure that omitting the TLB flush doesn't result in a large number > of "bad" reclaims[1]. And conversely, if there isn't much TLB pressure, then the > kernel shouldn't be reclaiming. > > For KVM, I want to completely eliminate the TLB flush[2] for all architectures > where it's architecturally legal. Because for KVM, the flushes are often even > more expensive than they are for the primary MMU, e.g. due to lack of batching, > the cost of VM-Exit => VM-Enter (for architectures without broadcast flushes). > > [1] https://lore.kernel.org/all/CAOUHufYCmYNngmS=rOSAQRB0N9ai+mA0aDrB9RopBvPHEK42Ng@xxxxxxxxxxxxxx > [2] https://lore.kernel.org/all/Zmnbb-Xlyz4VXNHI@xxxxxxxxxx It makes sense. Thanks for explanation and the provided links! Thinking more about the prefetched SPTEs, though the A bit tolerates fault negative, do you think we still can have a small optimization to grant A bit to prefetched SPTEs if the old_spte has already set it? So that if a prefault happens right after a real fault, the A bit would not be cleared, basing on that KVM not changing PFNs without first zapping the old SPTE. (but I'm not sure if you have already covered this optmication in the mega-series). --- a/arch/x86/kvm/mmu/spte.c +++ b/arch/x86/kvm/mmu/spte.c @@ -163,6 +163,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int level = sp->role.level; u64 spte = SPTE_MMU_PRESENT_MASK; bool wrprot = false; + bool remove_accessed = prefetch && (!is_shadow_present_pte(old_spte) || + !s_last_spte(old_spte, level) || !is_accessed_spte(old_spte)) /* * For the EPT case, shadow_present_mask has no RWX bits set if @@ -178,7 +180,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, spte |= SPTE_TDP_AD_WRPROT_ONLY; spte |= shadow_present_mask; - if (!prefetch) + if (!remove_accessed) spte |= spte_shadow_accessed_mask(spte); /* @@ -259,7 +261,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, spte |= spte_shadow_dirty_mask(spte); out: - if (prefetch) + if (remove_accessed) spte = mark_spte_for_access_track(spte); WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),