On Thu, Jun 02, 2016 at 06:57:00PM +0100, Will Deacon wrote: > > +++ b/include/asm-generic/qspinlock.h > > @@ -28,30 +28,13 @@ > > */ > > static __always_inline int queued_spin_is_locked(struct qspinlock *lock) > > { > > + /* > > + * See queued_spin_unlock_wait(). > > * > > + * Any !0 state indicates it is locked, even if _Q_LOCKED_VAL > > + * isn't immediately observable. > > */ > > - smp_mb(); > > + return !!atomic_read(&lock->val); > > } > > I'm failing to keep up here :( > > The fast-path code in queued_spin_lock is just an atomic_cmpxchg_acquire. > If that's built out of LL/SC instructions, then why don't we need a barrier > here in queued_spin_is_locked? > > Or is the decision now that only spin_unlock_wait is required to enforce > this ordering? (warning: long, somewhat rambling, email) You're talking about the smp_mb() that went missing? So that wasn't the reason that smp_mb() existed..... but that makes the atomic_foo_acquire() things somewhat hard to use, because I don't think we want to unconditionally put the smp_mb() in there just in case. Now, the normal atomic_foo_acquire() stuff uses smp_mb() as per smp_mb__after_atomic(), its just ARM64 and PPC that go all 'funny' and need this extra barrier. Blergh. So lets shelf this issue for a bit. Let me write some text to hopefully explain where it did come from and why I now removed it. So the spin_is_locked() correctness issue comes from something like: CPU0 CPU1 global_lock(); local_lock(i) spin_lock(&G) spin_lock(&L[i]) for (i) if (!spin_is_locked(&G)) { spin_unlock_wait(&L[i]); smp_acquire__after_ctrl_dep(); return; } /* deal with fail */ Where it is important CPU1 sees G locked or CPU0 sees L[i] locked such that there is exclusion between the two critical sections. The load from spin_is_locked(&G) is constrained by the ACQUIRE from spin_lock(&L[i]), and similarly the load(s) from spin_unlock_wait(&L[i]) are constrained by the ACQUIRE from spin_lock(&G). Similarly, later stuff is constrained by the ACQUIRE from CTRL+RMB. Given a simple (SC) test-and-set spinlock the above is fairly straight forward and 'correct', right? Now, the 'problem' with qspinlock is that one possible acquire path goes like (there are more, but this is the easiest): smp_cond_acquire(!(atomic_read(&lock->val) & _Q_LOCKED_MASK)); clear_pending_set_locked(lock); And one possible implementation of clear_pending_set_locked() is: WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL); IOW, we load-acquire the locked byte until its cleared, at which point we know the pending byte to be 1. Then we consider the lock owned by us and issue a regular unordered store to flip the pending and locked bytes. Normal mutual exclusion is fine with this, no new pending can happen until this store becomes visible at which time the locked byte is visibly taken. This unordered store however, can be delayed (store buffer) such that the loads from spin_unlock_wait/spin_is_locked can pass up before it (even on TSO arches). _IF_ spin_unlock_wait/spin_is_locked only look at the locked byte, this is a problem because at that point the crossed store-load pattern becomes uncrossed and we loose our guarantee. That is, what used to be: [S] G.locked = 1 [S] L[i].locked = 1 [MB] [MB] [L] L[i].locked [L] G.locked becomes: [S] G.locked = 1 [S] L[i].locked = 1 [L] L[i].locked [L] G.locked Which we can reorder at will and bad things follow. The previous fix for this was to include an smp_mb() in both spin_is_locked() and spin_unlock_wait() to restore that ordering. So at this point spin_is_locked() looks like: smp_mb(); while (atomic_read(&lock->val) & _Q_LOCKED_MASK) cpu_relax(); But for spin_unlock_wait() there is a second correctness issue, namely: CPU0 CPU1 flag = set; smp_mb(); spin_lock(&l) spin_unlock_wait(&l); if (flag) /* bail */ /* add to lockless list */ spin_unlock(&l); /* iterate lockless list */ Which ensures that CPU1 will stop adding bits to the list and CPU0 will observe the last entry on the list (if spin_unlock_wait() had ACQUIRE semantics etc..) This however, is still broken.. nothing ensures CPU0 sees l.locked before CPU1 tests flag. So while we fixed the first correctness case (global / local locking as employed by ipc/sem and nf_conntrack) this is still very much broken. My patch today rewrites spin_unlock_wait() and spin_is_locked() to rely on more information to (hopefully -- I really need sleep) fix both. The primary observation is that even though the l.locked store is delayed, there has been a prior atomic operation on the lock word to register the contending lock (in the above scenario, set the pending byte, in the other paths, queue onto the tail word). This means that any load passing the .locked byte store, must at least observe that state. Therefore, spin_is_locked() looks for any !0 state and is done. Even if the locked byte is cleared, if any of the other fields are !0 it is in fact locked. spin_unlock_wait() is slightly more complex, it becomes: 1) if the entire word is 0 -- we're unlocked, done 2) if the locked byte is 0 (there must be other state, otherwise the previous case), wait until we see the locked byte. 3) wait for the locked byte to be cleared Which relies on the same observation, that even though we might not observe the locked store, we must observe a store by the contending lock. This means the scenario above now becomes: [S] G.pending = 1 [S] L[i].pending = 1 [MB] [MB] [S] G.locked_pending = 1 [S] L[i].locked_pending = 1 [L] L[i] [L] G And things are good again, we simply do not care if the unordered store is observed or not. Make sense? -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html