On Mon, Nov 25, 2013 at 06:35:40PM +0100, Peter Zijlstra wrote: > On Fri, Nov 22, 2013 at 07:51:07PM +0100, Peter Zijlstra wrote: > > On Fri, Nov 22, 2013 at 10:26:32AM -0800, Paul E. McKenney wrote: > > > The real source of my cognitive pain is that here we have a sequence of > > > code that has neither atomic instructions or memory-barrier instructions, > > > but it looks like it still manages to act as a full memory barrier. > > > Still not quite sure I should trust it... > > > > Yes, this is something that puzzles me too. > > > > That said, the two rules that: > > > > 1) stores aren't re-ordered against other stores > > 2) reads aren't re-ordered against other reads > > > > Do make that: > > > > STORE x > > LOAD x > > > > form a fence that neither stores nor loads can pass through from > > either side; note however that they themselves rely on the data > > dependency to not reorder against themselves. > > > > If you put them the other way around: > > > > LOAD x > > STORE y > > > > we seem to get a stronger variant because stores are not re-ordered > > against older reads. > > > > There is however the exception cause for rule 1) above, which includes > > clflush, non-temporal stores and string ops; the actual mfence > > instruction doesn't seem to have this exception and would thus be > > slightly stronger still. > > > > Still confusion situation all round. > > I think this means x86 needs help too. I still do not believe that it does. Again, strangely enough. We need to ask someone in Intel that understands this all the way down to the silicon. The guy I used to rely on for this no longer works at Intel. Do you know someone who fits this description, or should I start sending cold-call emails to various Intel contacts? > Consider: > > x = y = 0 > > w[x] = 1 | w[y] = 1 > mfence | mfence > r[y] = 0 | r[x] = 0 > > This is generally an impossible case, right? (Since if we observe y=0 > this means that w[y]=1 has not yet happened, and therefore x=1, and > vice-versa). > > Now replace one of the mfences with smp_store_release(l1); > smp_load_acquire(l2); such that we have a RELEASE+ACQUIRE pair that > _should_ form a full barrier: > > w[x] = 1 | w[y] = 1 > w[l1] = 1 | mfence > r[l2] = 0 | r[x] = 0 > r[y] = 0 | > > At which point we can observe the impossible, because as per the rule: > > 'reads may be reordered with older writes to different locations' > > Our r[y] can slip before the w[x]=1. > > Thus x86's smp_store_release() would need to be: > > +#define smp_store_release(p, v) \ > +do { \ > + compiletime_assert_atomic_type(*p); \ > + smp_mb(); \ > + ACCESS_ONCE(*p) = (v); \ > +} while (0) > > Or: (void)xchg((p), (v)); > > Idem for s390 and sparc I suppose. > > The only reason your example worked is because the unlock and lock were > for the same lock. Exactly!!! And if the two locks are different, then the guarantee applies only when the unlock and lock are on the same CPU, in which case, as Linus noted, the xchg() on entry to the slow path does the job for use. > This of course leaves us without joy for circular buffers, which can do > without this LOCK'ed op and without sync on PPC. Now I'm not at all sure > we've got enough of those to justify primitives just for them. I am beginning to think that we do, but that is a separate discussion. Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>