Re: Memory model release/acquire mode interactions of relaxed atomic operations

Toebs Douglass <toby@xxxxxxxxxxxxxx> · Fri, 5 May 2017 18:27:25 +0200

On 05/05/17 17:03, Andrew Haley wrote:
> On 05/05/17 10:09, Toebs Douglass wrote:

>> So this is I think a point where we think different things.
>>
>> I would qualify my view to say this : I can't say in what may or may not
>> happen on any particular architecture as I simply don't know details
>> about most implementations, but, although I may be wrong, I do think
>> that in general memory barriers do not cause stores to complete.  I mean
>> to say this in the sense of writing cross-platform code which assumes
>> the minimum guarantees provided across platforms.
> 
> You're using the word "complete" without explaining what you really
> mean by it.  I think you mean that a store becomes visible to other
> processors, so that's how I'll treat it.

Yes.  I mean that it reaches the MESI protocol (e.g. the first level
cache), such that if another processor issues a load barrier, it will
then see the value which was stored.  Prior to this point, although the
storing processor has written the value and sees the world as if were
written, no other processor can know of the store, even if it issues a
load barrier.

>>> What exactly do you mean by "a store barrier"?
>>
>> A barrier issued to the processor such that all stores prior to that
>> barrier must complete before any store after the barrier completes.
> 
> Right, that's a StoreStore, i.e. it only orders stores and other stores.

Yes.  I only use LoadLoad and StoreStore - what I would call a load
barrier and a store barrier.  The LoadStore and StoreLoad, I've never
used.  I may be wrong, but I believe they are available on fewer
processors than they are not available on - Intel doesn't have them, for
example (but Intel is an exceptional case anyway, since they bundle
barriers together with atomic operations).

I don't think I've ever needed to use a full memory barrier.  I might be
using one with hazard pointers, but I can't remember offhand.

>> Where I think store barriers do not cause stores to complete, LL/SC
>> / LOCK is important, because they force a store to complete (the
>> atomic operation) and this in turn forces previously issued store
>> barriers to be honoured (i.e. that the stores before those barriers
>> must now complete, before the LL/SC / LOCK store completes).
> 
> That's because LL/SC or LOCK store often have implicit StoreLoad
> barriers associated with them.  That's true on x86, for example.  But
> it's a dangerous way to think, because they don't always: on ARMv8,
> for example, it's optional.  But this is the crux of the matter: it's
> the StoreLoad barrier that causes all of the previous stores to become
> visible, not the LL/SC or LOCK store.

Right - this is exactly the point were we differ.  I think the barrier
does nothing for completion.  It is only the LL/SC or LOCK which forces
a store, and so forces the previous store barriers to be honoured, which
forces the earlier stores to complete.  This means if the LL/SC or LOCK
is not used, then things are very different to what you'd expect, if it
is thought the barrier itself is forcing completion.

The SPARC docs indicate this is so for SPARC - these barriers are not
forcing completion - and although I may be wrong, for I am hardly well
informed about processor platforms, I think it in fact true in general.

I know of *no* platform where the store barrier itself will cause
earlier stores to complete.  (Although we do see for example on SPARC
for example additional barrier types, separate from load/store, which
*do* control completion - the generic, cross-platform way of doing this
being to perform an atomic operation.)

To put it another way; if in thread A, I store to variable A and then
issue a store barrier, and then *later* in thread B I issued a load
barrier and load from A, *I still can fail to read the value written by
thread A*.  This because the store barrier does not cause completion to
occur.

However, if in thread A I stored to A, issued a store barrier, and then
performed an atomic operation (LL/SC or LOCK), the atomic operation
would by forcing a store to complete cause all stores prior to the store
barrier to complete first, which would now finally means when thread B
issued its load barrier, it *would* now see the value written to A by
thread A.

I may be wrong, but I think a *lot* of what is written on-line gets this
wrong.

>> So for me this is how I guarantee other cores will be able to see
>> stores (after those other cores issue a load barrier, of course).
>> Where I think a store barrier does not cause stores to complete, I
>> need a way to cause stores to complete.
> 
> That's a StoreLoad.  I don't know what your processor calls it.  It's
> usually a full barrier.

Yes.  I think of load, store and full barriers.  That paradigm works
over all processors; you need processors with more fine grained control,
such as SPARC, to get more subtlety.