On Mon, Sep 09, 2024, James Houghton wrote: > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > +/* > > + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always > > + * operates with mmu_lock held for write), but rmaps can be walked without > > + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain > > + * being zapped/dropped _while the rmap is locked_. > > + * > > + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be > > + * done while holding mmu_lock for write. This allows a task walking rmaps > > + * without holding mmu_lock to concurrently walk the same entries as a task > > + * that is holding mmu_lock but _not_ the rmap lock. Neither task will modify > > + * the rmaps, thus the walks are stable. > > + * > > + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED, > > + * only the rmap chains themselves are protected. E.g. holding an rmap's lock > > + * ensures all "struct pte_list_desc" fields are stable. > > This last sentence makes me think we need to be careful about memory ordering. > > > + */ > > +#define KVM_RMAP_LOCKED BIT(1) > > + > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) > > +{ > > + unsigned long old_val, new_val; > > + > > + old_val = READ_ONCE(rmap_head->val); > > + if (!old_val) > > + return 0; > > + > > + do { > > + /* > > + * If the rmap is locked, wait for it to be unlocked before > > + * trying acquire the lock, e.g. to bounce the cache line. > > + */ > > + while (old_val & KVM_RMAP_LOCKED) { > > + old_val = READ_ONCE(rmap_head->val); > > + cpu_relax(); > > + } > > + > > + /* > > + * Recheck for an empty rmap, it may have been purged by the > > + * task that held the lock. > > + */ > > + if (!old_val) > > + return 0; > > + > > + new_val = old_val | KVM_RMAP_LOCKED; > > + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val)); > > I think we (technically) need an smp_rmb() here. I think cmpxchg > implicitly has that on x86 (and this code is x86-only), but should we > nonetheless document that we need smp_rmb() (if it indeed required)? > Perhaps we could/should condition the smp_rmb() on `if (old_val)`. Hmm, no, not smp_rmb(). If anything, the appropriate barrier here would be smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably even wrong. For the !old_val case, there is a address/data dependency that can't be broken by the CPU without violating the x86 memory model (all future actions with relevant memory loads depend on rmap_head->val being non-zero). And AIUI, in the Linux kernel memory model, READ_ONCE() is responsible for ensuring that the address dependency can't be morphed into a control dependency by the compiler and subsequently reordered by the CPU. I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of __READ_ONCE() promotes the operation to an acquire. Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock, hence my smp_mb__after_spinlock() suggestion. Though _because_ it's a spinlock, the rmaps are fully protected by the critical section. And for the SPTEs, there is no required ordering. The reader (aging thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for either. I.e. there is no requirement that the reader observe a PRESENT SPTE if there is a valid rmap. So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(), even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests an ordering requirement that doesn't exist. > kvm_rmap_lock_readonly() should have an smb_rmb(), but it seems like > adding it here will do the right thing for the read-only lock side. > > > + > > + /* Return the old value, i.e. _without_ the LOCKED bit set. */ > > + return old_val; > > +} > > + > > +static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head, > > + unsigned long new_val) > > +{ > > + WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED); > > Same goes with having an smp_wmb() here. Is it necessary? If so, > should it at least be documented? > > And this is *not* necessary for kvm_rmap_unlock_readonly(), IIUC. > > > + WRITE_ONCE(rmap_head->val, new_val); > > +}