On Fri, Sep 27, 2024 at 07:32:10AM -0700, Sean Christopherson wrote: > On Fri, Sep 27, 2024, Sean Christopherson wrote: > > On Thu, Sep 26, 2024, Yan Zhao wrote: > > > On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote: > > > > On Thu, Sep 12, 2024, Isaku Yamahata wrote: > > > > Right now, the fixes for make_spte() are sitting toward the end of the massive > > > > kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am > > > > fairly confident that series can land in 6.13 (lots and lots of small patches). > > > > > > > > --- > > > > Author: Sean Christopherson <seanjc@xxxxxxxxxx> > > > > AuthorDate: Thu Sep 12 16:23:21 2024 -0700 > > > > Commit: Sean Christopherson <seanjc@xxxxxxxxxx> > > > > CommitDate: Thu Sep 12 16:35:06 2024 -0700 > > > > > > > > KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits > > > > > > > > Do a remote TLB flush if installing a leaf SPTE overwrites an existing > > > > leaf SPTE (with the same target pfn) and clears the Writable bit or the > > > > Dirty bit. KVM isn't _supposed_ to clear Writable or Dirty bits in such > > > > a scenario, but make_spte() has a flaw where it will fail to set the Dirty > > > > if the existing SPTE is writable. > > > > > > > > E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1 > > > > SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the > > > > second vCPU. If the first vCPU (or another vCPU) accesses memory using > > > > the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is > > > > the only SPTE that is dirty at the time of the next relevant clearing of > > > > the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs > > > > because it sees the D=0 SPTE, and thus will complete the clearing of the > > > > dirty logs without performing a TLB flush. > > > But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no > > > matter clear_dirty_gfn_range() finds a D bit or not. > > > > Oh, right, I forgot about that. I'll tweak the changelog to call that out before > > posting. Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm: > > x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if > > anything should be backported it's that commit. > > Actually, a better idea. I think it makes sense to fully commit to not flushing > when overwriting SPTEs, and instead rely on the dirty logging logic to do a remote > TLB flush. > > E.g. on top of this change in the mega-series is a cleanup to unify the TDP MMU > and shadow MMU logic for clearing Writable and Dirty bits, with this comment > (which is a massaged version of an existing comment for mmu_spte_update()): > > /* > * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote > * TLBs must be flushed. Otherwise write-protecting the gfn may find a read- > * only SPTE, even though the writable SPTE might be cached in a CPU's TLB. > * > * Remote TLBs also need to be flushed if the Dirty bit is cleared, as false > * negatives are not acceptable, e.g. if KVM is using D-bit based PML on VMX. > * > * Don't flush if the Accessed bit is cleared, as access tracking tolerates > * false negatives, and the one path that does care about TLB flushes, > * kvm_mmu_notifier_clear_flush_young(), uses mmu_spte_update_no_track(). I have a question about why access tracking tolerates false negatives on the path kvm_mmu_notifier_clear_flush_young(). kvm_mmu_notifier_clear_flush_young() invokes kvm_flush_remote_tlbs() only when kvm_age_gfn() returns true. But age_gfn_range()/kvm_age_rmap() will return false if the old spte is !is_accessed_spte(). So, if the Access bit is cleared in make_spte(), is a TLB flush required to avoid that it's not done in kvm_mmu_notifier_clear_flush_young()? > * > * Note, this logic only applies to leaf SPTEs. The caller is responsible for > * determining whether or not a TLB flush is required when modifying a shadow- > * present non-leaf SPTE. > */ > > But that comment is was made stale by commit b64d740ea7dd. And looking through > the dirty logging logic, KVM (luckily? thankfully?) flushes based on whether or > not dirty bitmap/ring entries are reaped, not based on whether or not SPTEs were > modified. >