Re: Memory model release/acquire mode interactions of relaxed atomic operations

Toebs Douglass <toby@xxxxxxxxxxxxxx> · Thu, 4 May 2017 22:18:39 +0200

On 04/05/17 20:04, Andrew Haley wrote:
> On 04/05/17 16:52, Toebs Douglass wrote:
>> On 04/05/17 16:21, Andrew Haley wrote:
>>> Either works.  The mappings from C++ atomics to processors are here:
>>>
>>> https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
>>
>> Ah, that is interesting, and makes a lot of sense.
>>
>> For SC, it's atomic.  For everything else, not - which means for
>> everything else, although ordering is of course guaranteed, visibility
>> is not, and we rely on the processor doing something "in a reasonable
>> time" (which might for example be long enough that things break).
> 
> Umm, what?  All access modes are atomic.

We have to be careful here, because we may have different ideas about
what atomic means.  I may be completely wrong, but I think I understand
memory barriers and atomic operations, but I think I do not always use
formal terms exactly as they are used in the field.

So, here, I am not sure what you mean by atomic.

I do think though that everything other than SC is not atomic (as I use
the word).  This means, for example, that the store may *never* be seen
by any other core.  In practise I'm sure this doesn't happen, but there
are no guarantees, and I think for code to be always correct, the
assumption has to be made that it is so.

>>>> I've just had a bit of an online search, as best I could, through the
>>>> GCC source code.  It looks like expand_atomic_store() does use an atomic
>>>> exchange or atomic CAS.
>>>
>>> That depends on your machine.  On mine (ARMv8) a seq.cst store uses stlr.
>>
>> I'm surprised.  I would expect that to be able to fail (because of the
>> "reasonable time").  I don't know much about ARM though (or about Intel,
>> for that matter :-)
> 
> Eventually the processor will be pre-empted for some reason or the
> cache line which contains the store will be flushed because of another
> access, but it could be a long wait.  I've seen delays of thousands
> of instructions, but it could be longer.

This - cache line flushing - is not the issue I have in mind, for to
have reached a cache, the data in question will then be participating in
the MESI protocol, and so it will be visible to other processors.

The problem I have in mind is store buffers and that store barriers do
not cause stores to complete.  The processor performing the store will
think it has issued the store and see the world accordingly *prior even
to the store reaching the first level cache*, and there is no guarantee
about how long this state of affairs persists.

So if we perform a store and then a store barrier, we have nothing -
there is no guarantee any other core has seen this store or ever will.
We only have a guarantee that IF a store *after* the store barrier is
going to complete, all stores prior to the barrier will be forced to
actually complete, so that they complete first (and thus honour the
store barrier).

In other words, all stores which do not use LL/SC or LOCK (or equivalent
thereof) can in effect never occur.  They're just regular stores, with
ordering constraints provided by memory barriers - I don't think of them
as atomic.  Atomic to me means the store will be forced to complete.