On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Mon, Sep 09, 2024, James Houghton wrote: > > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > + */ > > > +#define KVM_RMAP_LOCKED BIT(1) > > > + > > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) > > > +{ > > > + unsigned long old_val, new_val; > > > + > > > + old_val = READ_ONCE(rmap_head->val); > > > + if (!old_val) > > > + return 0; > > > + > > > + do { > > > + /* > > > + * If the rmap is locked, wait for it to be unlocked before > > > + * trying acquire the lock, e.g. to bounce the cache line. > > > + */ > > > + while (old_val & KVM_RMAP_LOCKED) { > > > + old_val = READ_ONCE(rmap_head->val); > > > + cpu_relax(); > > > + } > > > + > > > + /* > > > + * Recheck for an empty rmap, it may have been purged by the > > > + * task that held the lock. > > > + */ > > > + if (!old_val) > > > + return 0; > > > + > > > + new_val = old_val | KVM_RMAP_LOCKED; > > > + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val)); > > > > I think we (technically) need an smp_rmb() here. I think cmpxchg > > implicitly has that on x86 (and this code is x86-only), but should we > > nonetheless document that we need smp_rmb() (if it indeed required)? > > Perhaps we could/should condition the smp_rmb() on `if (old_val)`. > > Hmm, no, not smp_rmb(). If anything, the appropriate barrier here would be > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably > even wrong. I don't think smp_mb__after_spinlock() is right either. This seems to be used following the acquisition of a spinlock to promote the memory ordering from an acquire barrier (that is implicit with the lock acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a stronger-than-usual barrier. But I guess I'm not really sure. In this case, I'm complaining that we don't have the usual memory ordering restrictions that come with a spinlock. > For the !old_val case, there is a address/data dependency that can't be broken by > the CPU without violating the x86 memory model (all future actions with relevant > memory loads depend on rmap_head->val being non-zero). And AIUI, in the Linux > kernel memory model, READ_ONCE() is responsible for ensuring that the address > dependency can't be morphed into a control dependency by the compiler and > subsequently reordered by the CPU. > > I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I > don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of > __READ_ONCE() promotes the operation to an acquire. > > Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock, > hence my smp_mb__after_spinlock() suggestion. Though _because_ it's a spinlock, > the rmaps are fully protected by the critical section. I feel like a spinlock must include the appropriate barriers for it to correctly function as a spinlock, so I'm not sure I fully understand what you mean here. > And for the SPTEs, there > is no required ordering. The reader (aging thread) can observe a !PRESENT or a > PRESENT SPTE, and must be prepared for either. I.e. there is no requirement that > the reader observe a PRESENT SPTE if there is a valid rmap. This makes sense. > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(), > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests > an ordering requirement that doesn't exist. So we have: the general kvm_rmap_lock() and the read-only kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use those names (sorry if it's confusing). For kvm_rmap_lock(), we are always holding mmu_lock for writing. So any changes we make to the rmap will be properly published to other threads that subsequently grab kvm_rmap_lock() because we had to properly release and then re-acquire mmu_lock, which comes with the barriers I'm saying we need. For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that it is possible that there may observe external changes to the pte_list_desc while we are in this critical section (for a sufficiently weak architecture). The changes that the kvm_rmap_lock() (mmu_lock) side made were half-published with an smp_wmb() (really [3]), but the read side didn't use a load-acquire or smp_rmb(), so it hasn't held up its end of the deal. I don't think READ_ONCE() has the guarantees we need to be a sufficient replacement for smp_rmb() or a load-acquire that a real lock would use, although I agree with you that, on arm64, it apparently *is* a sufficient replacement. Now this isn't a problem if the kvm_rmap_lock_readonly() side can tolerate changes to pte_list_desc while in the critical section. I don't think this is true (given for_each_rmap_spte_lockless), therefore an smp_rmb() or equivalent is (technically) needed. Am I confused? (Though all of this works just fine as written on x86.) [1]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L62 [2]: https://lore.kernel.org/kvm/20240809194335.1726916-21-seanjc@xxxxxxxxxx/ [3]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L190