Re: [PATCH v6 4/5] MCS Lock: Barrier corrections

"Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> · Mon, 25 Nov 2013 09:53:15 -0800

On Sat, Nov 23, 2013 at 12:21:13PM -0800, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 5:36 PM, Paul E. McKenney
> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > But this does -not- guarantee that some other non-lock-holding CPU 2 will
> > see CS0 and CS1 in order.  To see this, let's fill in the two critical
> > sections, where variables X and Y are both initially zero:
> >
> >         CPU 0 (releasing)       CPU 1 (acquiring)
> >         -----                   -----
> >         ACCESS_ONCE(X) = 1;     while (ACCESS_ONCE(lock) == 1)
> >         lwsync                          continue;
> >         ACCESS_ONCE(lock) = 0;  lwsync
> >                                 r1 = ACCESS_ONCE(Y);
> >
> > Then let's add in the observer CPU 2:
> >
> >         CPU 2
> >         -----
> >         ACCESS_ONCE(Y) = 1;
> >         sync
> >         r2 = ACCESS_ONCE(X);
> >
> > If unlock+lock act as a full memory barrier, it would be impossible to
> > end up with (r1 == 0 && r2 == 0).  After all, (r1 == 0) implies that
> > CPU 2's store to Y happened after CPU 1's load from Y, and (r2 == 0)
> > implies that CPU 0's load from X happened after CPU 2's store to X.
> > If CPU 0's unlock combined with CPU 1's lock really acted like a full
> > memory barrier, we end up with CPU 0's load happening before CPU 1's
> > store happening before CPU 2's store happening before CPU 2's load
> > happening before CPU 0's load.
> >
> > However, the outcome (r1 == 0 && r2 == 0) really does happen both
> > in theory and on real hardware.
> 
> Ok, so I have ruminated.
> 
> But even after having ruminated, the one thing I cannot find is
> support for your "(r1 == 0 && r2 == 0) really does happen on
> hardware".
> 
> Ignore theory, and assume just cache coherency,

Sorry, but that is not ignoring theory.  Ignoring theory would instead
mean confining oneself to running tests on real hardware.  And there is
real hardware that does allow the assertion to trigger.  You are instead
asking me to use your personal theory instead of a theory that has been
shown to match reality.

Let's see how that plays out.

>                                                 ie the notion that in
> order to write to a cacheline, you have to have that cacheline in some
> exclusive state. We have three cachelines, X, Y and lock (and our
> cachelines only have a single bit, starting out as 0,0,1
> respectively).
> 
> CPU0:
>    write X = 1;
>    lwsync
>    write lock = 0;
> 
> doesn't even really require any cache accesses at all per se, but it
> *does* require that the two stores be ordered in the store buffer on
> CPU0 in such a way that cacheline X gets updated (ie is in
> exclusive/dirty state in CPU0 with the value 1) before cacheline
> 'lock' gets released from its exclusive/dirty state after having
> itself been updated to 0.

So far, so good.

> So basically we know that *within* CPU0, by the time the lock actually
> makes it out of the CPU, the cacheline containing X will have been in
> dirty mode with the value "1".

You seem to be assuming that the only way for the cache line to make it
out of the CPU is via the caches.  This assumption is incorrect given
hardware multithreading, in which case the four hardware threads in a
core (in the case of Power 7) can share a store buffer, and can thus
communicate via that store buffer.

Another way of putting this is that you are assuming multi-copy atomic
behavior, which is in fact guaranteed on x86 by the the bullet in 8.2.2
of the "Intel 64 and IA-32 Architectures Software Developer's Manual"
which reads:

	"Any two stores are seen in a consistent order by processors
	other than those performing the stores."

The reality is that not all architectures guarantee multi-copy atomic
behavior.

>                                The core might actually have written
> 'lock' first, but it can't release that cacheline from exclusive state
> (so that it is visible anywhere else) until it has _also_ gotten 'X'
> into exclusive state (once both cachelines are exclusive within CPU0,
> it can do any ordering, because the ordering won't be externally
> visible).

Not always true when store buffers are shared among hardware threads!
In particular, consider the case where CPU 0 and CPU 1 share a store
buffer and CPU 2 is on some other core.  CPU 1 sees CPU 2's accesses
in order, but the lwsync instructions do not order prior stores against
later loads.  Therefore, it is legal for CPU 0's store to X be released
from the core -after- CPU 1's load from Y.  CPU 2's sync cannot help in
this case, so the assertion can trigger.

Please note that this does -not- violate cache coherence:  All three
CPUs agree on the order of accesses to each individual memory location.
(Or do you mean something else by "cache coherence"?)

> And this is all just looking at CPU0, nothing else. But if it is
> exclusive/dirty on CPU0, then it cannot be shared in any other CPU
> (although a previous stale value may obviously still be "in flight"
> somewhere else outside of the coherency domain).
> 
> So let's look at CPU2, which is similar, but now the second access is
> a read (of zero), not a write:
> 
> CPU2:
>    write Y = 1;
>    sync
>    read X as zero
> 
> So if 'sync' is a true memory barrier between the write and the read,
> then we know that the following is true: CPU2 must have gotten
> cacheline 'Y' into exclusive state and acquired (or held on to, which
> is equivalent) cacheline 'X' in shared state _after_ it got that Y
> into exclusive state. It can't rely on some "in flight" previously
> read value of 'X' until after it got Y into exclusive state. Otherwise
> it wouldn't be a ordering between the write and the read, agreed?

Again, this line of reasoning does not take into account the possibility
of store buffers being shared between hardware threads within a single
core.  The key point is that CPU 0 and CPU 1 can be sharing the new value
of X prior to its reaching the cache, in other words, before CPU 2 can
see it.

So, no, I do not agree that this holds for all real hardware.

> The pattern on CPU1, meanwhile, is somewhat different. But I'm going
> to ignore the "while" part, and just concentrate on the last iteration
> of the loop, and it turns into:
> 
> CPU1:
>    read lock as zero
>    lwsync
>    read Y as zero
> 
> It only does reads, so it is happy with a shared cacheline, but in
> order for the lwsync to actually act as an acquire, it does mean that
> the cacheline 'Y' needs to be in some shared state within the cache
> coherency domain after (or, again, across) cacheline 'lock' having
> been in a shared state with value == 0 on CPU1. No "in flight" values.

Or, if CPU 0 and CPU 1 are hardware threads in the same core, it is
happy with a shared store-buffer entry that might not yet be visible
to hardware threads in other cores.

> Agreed?

Sorry, but no.

> So the above isn't really about lwsync/sync/memory ordering any more,
> the above is basically rewriting things purely about cacheline states
> as seen by individual processors. And I know the POWER cache coherency
> is really complex (iirc cachelines can have 11+ states - we're not
> talking about MOESI any more), but even there you have to have a
> notion of "exclusive access" to be able to write in the end. So my
> states are certainly simplified, but I don't see how that basic rule
> can be violated (and still be called cache coherent).
> 
> And let's just look at the individual "events" on these CPU's:
> 
>  - A = CPU0 acquires exclusive access to cacheline X (in order to
> write 1 into it)
>  - B = CPU0 releases its exclusive access to cacheline lock (after
> having written 0 into it)
>  - C = CPU1 reads shared cacheline lock as being zero
>  - D = CPU1 reads shared cacheline Y as being zero
>  - E = CPU2 acquires exclusive access to cacheline Y (in order to
> write 1 into it)
>  - F = CPU2 reads shared cacheline X as being zero
> 
> and you cannot order these events arbitrarily, but there *ARE* certain
> orderings you can rely on:
> 
>  - within a CPU due to the barriers. This gives you
> 
>     A < B, C < D and E < F

Again, consider the case of shared store buffers.  In that case,
A < B and C < D does -not- imply A < D because the two lwsyncs will
-not- order a prior store (CPU 0's store to X) with a later load
(CPU 1's load from Y).  To get that guarantee, at least one of the
lwsync instructions needs to instead be a sync.

>  - with regards to a particular cacheline because of that cacheline
> coming exclusive (or, in the case of 'B', moving out of exclusive
> state) and the value it has at that point:
> 
>     B < C, F < A and D < E
> 
> And as far as I can tell, the above gives you: A < B < C < D < E < F <
> A. Which doesn't look possible.
> 
> So which step in my rumination here is actually wrong? Because I
> really don't see how you can get that "(r1 == 0 && r2 == 0)" on real
> hardware using cache coherency.
> 
> *SOME* assumption of mine must be wrong. But I don't see which one.

Wrong assumption 1.  Each hardware thread has its own store buffer.

Wrong assumption 2.  All architectures guarantee multi-copy atomic behavior.

							Thanx, Paul

>                    Linus
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>