Re: [PATCH v6 4/5] MCS Lock: Barrier corrections

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sat, 23 Nov 2013 12:21:13 -0800

On Fri, Nov 22, 2013 at 5:36 PM, Paul E. McKenney
<paulmck@xxxxxxxxxxxxxxxxxx> wrote:
>
> But this does -not- guarantee that some other non-lock-holding CPU 2 will
> see CS0 and CS1 in order.  To see this, let's fill in the two critical
> sections, where variables X and Y are both initially zero:
>
>         CPU 0 (releasing)       CPU 1 (acquiring)
>         -----                   -----
>         ACCESS_ONCE(X) = 1;     while (ACCESS_ONCE(lock) == 1)
>         lwsync                          continue;
>         ACCESS_ONCE(lock) = 0;  lwsync
>                                 r1 = ACCESS_ONCE(Y);
>
> Then let's add in the observer CPU 2:
>
>         CPU 2
>         -----
>         ACCESS_ONCE(Y) = 1;
>         sync
>         r2 = ACCESS_ONCE(X);
>
> If unlock+lock act as a full memory barrier, it would be impossible to
> end up with (r1 == 0 && r2 == 0).  After all, (r1 == 0) implies that
> CPU 2's store to Y happened after CPU 1's load from Y, and (r2 == 0)
> implies that CPU 0's load from X happened after CPU 2's store to X.
> If CPU 0's unlock combined with CPU 1's lock really acted like a full
> memory barrier, we end up with CPU 0's load happening before CPU 1's
> store happening before CPU 2's store happening before CPU 2's load
> happening before CPU 0's load.
>
> However, the outcome (r1 == 0 && r2 == 0) really does happen both
> in theory and on real hardware.

Ok, so I have ruminated.

But even after having ruminated, the one thing I cannot find is
support for your "(r1 == 0 && r2 == 0) really does happen on
hardware".

Ignore theory, and assume just cache coherency, ie the notion that in
order to write to a cacheline, you have to have that cacheline in some
exclusive state. We have three cachelines, X, Y and lock (and our
cachelines only have a single bit, starting out as 0,0,1
respectively).

CPU0:
   write X = 1;
   lwsync
   write lock = 0;

doesn't even really require any cache accesses at all per se, but it
*does* require that the two stores be ordered in the store buffer on
CPU0 in such a way that cacheline X gets updated (ie is in
exclusive/dirty state in CPU0 with the value 1) before cacheline
'lock' gets released from its exclusive/dirty state after having
itself been updated to 0.

So basically we know that *within* CPU0, by the time the lock actually
makes it out of the CPU, the cacheline containing X will have been in
dirty mode with the value "1". The core might actually have written
'lock' first, but it can't release that cacheline from exclusive state
(so that it is visible anywhere else) until it has _also_ gotten 'X'
into exclusive state (once both cachelines are exclusive within CPU0,
it can do any ordering, because the ordering won't be externally
visible).

And this is all just looking at CPU0, nothing else. But if it is
exclusive/dirty on CPU0, then it cannot be shared in any other CPU
(although a previous stale value may obviously still be "in flight"
somewhere else outside of the coherency domain).

So let's look at CPU2, which is similar, but now the second access is
a read (of zero), not a write:

CPU2:
   write Y = 1;
   sync
   read X as zero

So if 'sync' is a true memory barrier between the write and the read,
then we know that the following is true: CPU2 must have gotten
cacheline 'Y' into exclusive state and acquired (or held on to, which
is equivalent) cacheline 'X' in shared state _after_ it got that Y
into exclusive state. It can't rely on some "in flight" previously
read value of 'X' until after it got Y into exclusive state. Otherwise
it wouldn't be a ordering between the write and the read, agreed?

The pattern on CPU1, meanwhile, is somewhat different. But I'm going
to ignore the "while" part, and just concentrate on the last iteration
of the loop, and it turns into:

CPU1:
   read lock as zero
   lwsync
   read Y as zero

It only does reads, so it is happy with a shared cacheline, but in
order for the lwsync to actually act as an acquire, it does mean that
the cacheline 'Y' needs to be in some shared state within the cache
coherency domain after (or, again, across) cacheline 'lock' having
been in a shared state with value == 0 on CPU1. No "in flight" values.

Agreed?

So the above isn't really about lwsync/sync/memory ordering any more,
the above is basically rewriting things purely about cacheline states
as seen by individual processors. And I know the POWER cache coherency
is really complex (iirc cachelines can have 11+ states - we're not
talking about MOESI any more), but even there you have to have a
notion of "exclusive access" to be able to write in the end. So my
states are certainly simplified, but I don't see how that basic rule
can be violated (and still be called cache coherent).

And let's just look at the individual "events" on these CPU's:

 - A = CPU0 acquires exclusive access to cacheline X (in order to
write 1 into it)
 - B = CPU0 releases its exclusive access to cacheline lock (after
having written 0 into it)
 - C = CPU1 reads shared cacheline lock as being zero
 - D = CPU1 reads shared cacheline Y as being zero
 - E = CPU2 acquires exclusive access to cacheline Y (in order to
write 1 into it)
 - F = CPU2 reads shared cacheline X as being zero

and you cannot order these events arbitrarily, but there *ARE* certain
orderings you can rely on:

 - within a CPU due to the barriers. This gives you

    A < B, C < D and E < F

 - with regards to a particular cacheline because of that cacheline
coming exclusive (or, in the case of 'B', moving out of exclusive
state) and the value it has at that point:

    B < C, F < A and D < E

And as far as I can tell, the above gives you: A < B < C < D < E < F <
A. Which doesn't look possible.

So which step in my rumination here is actually wrong? Because I
really don't see how you can get that "(r1 == 0 && r2 == 0)" on real
hardware using cache coherency.

*SOME* assumption of mine must be wrong. But I don't see which one.

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>