Re: Is Alpha non-other-multicopy atomic?

Ignacio Encinas Rubio <ignacio@xxxxxxxxxxxx> · Tue, 18 Mar 2025 19:16:38 +0100

Not trimming email content in case it makes reading the reply less
confusing. I hope I'm not being too annoying and wasting too much of
your time on this.

On 17/3/25 10:58, Akira Yokosawa wrote:
> Ignacio Encinas Rubio wrote:
>> On 16/3/25 12:37, Akira Yokosawa wrote:
>>> As is well known, Alpha is infamous of its lack of address-dependency
>>> guarantees.  I don't see much point in discussing whether it is multicopy
>>> atomic or not.
>>
>> I agree this is not particularly important, I just found contradicting
>> information and wanted to clarify/fix Table 15.5.
>>
>>> If you have access to an Alpha machine with 3 or more CPUs,
>>> you should be able to run this test with the help of klitmus7.
>>
>> Thanks for the test. Sadly, I don't have access to an Alpha CPU (I was
>> born after DEC was bought by Compaq)
> 
> That is what I had anticipated ...
> 
>>
>>> Finally, my mental model of other-multi-copy might be different from
>>> those defined in papers you cited below. 
>>
>> Might be, I have just realized mine is wrong. I have just run Listing 
>> 15.16 (C-WRC...) in herd using the "linux" model to realize the exists
>> clause can trigger. 
>>
>> Note that it can do so while "RFE" imposes order for the "linux" model 
>> (it is part of happens-before), so my original comment stating 
>>
>>   non-mca == rfe does not impose order 
>>
>> is wrong if Read[X = 0] (fr) -> Write[X = 1] does not mean that the read
>> is ordered before the write. However, this doesn't seem to be the case
>> for Alpha [3]
>>
>>   Section 5.6.1.2: The ordering relation Before (<=) is acyclic
>>   Section 5.6.1.4: If u and v are overlapping read/write accesses, at
>>     least one of which is a write, then u and v must be comparable in 
>>     the BEFORE (<=) ordering, that is, either u <= v or v <= u.
>>
> 
> The wording "If u and v are overlapping read/write accesses" sounds to
> me like it has nothing to do with what perfbook calls "multicopy atomicity".
> It is more related to the concept of perfbook's "single-copy atomicity" or
> "coherency", isn't it?
> 
> Quite from Section 15.3.6:
> 
>   On cache-coherent platforms, all CPUs agree on the order of loads and
>   stores to a given variable. Fortunately, when READ_ONCE() and WRITE_ONCE()
>   are used, almost all platforms are cache-coherent, as indicated by the "SV"
>   column of the cheat sheet shown in Table 15.3. Unfortunately, this property
>   is so popular that it has been named multiple times, with "single-variable
>   SC", "single-copy atomic" [SF95], and just plain "coherence" [AMP+11]
>   having seen use.
> 
> Miss-orderings related to address-dependent loads on Alpha happen only
> when two variables belong to two different cache banks as shown in
> Figure 15.25.
> They don't happen when u and v overlap (and share a cache line).
> 
> I'm suspecting that the meaning of "multicopy atomicity" in perfbook
> (or LKMM), and/or my interpretation of it, might not be exactly the
> same as what is widely accepted by computer industry people.
> 
> Are we on the same page now?

Yes, I was only (incorrectly) trying to elaborate on my assumption of
what makes a memory model non-multicopy atomic. I was under the false
impression that it could only come from relaxing "Read From External
(rfe)", similar to how "Read from Internal (rfi)" (or Store forwarding,
however you want to call this). In hardware terms, this could be sharing
the Store Buffer in a multithreaded CPU (presumably what PowerPC
implementations do).

I simply stared at the "WRC" litmus test diagram generated by herd [1]
for the case where the "exist" clause triggers and became more and more
confused at how "a: W[once][x]" is not "happened before (hb)" "e:
R[once][x]" (see attached svg file). Somehow I reached the conclusion 
that the way this must be allowed by the LKMM is by having "e" and "a" 
be non-comparable, that's why I invoked Section 5.6.1.4 from Alpha's 
manual. However, the comparison is flawed, most notably because for
Alpha a write (W) that consumes a read's (R) result isn't ordered with
respect to (R).

[1] https://diy.inria.fr/www/#

>> It's still my opinion that Alpha is other-multi-copy atomic but I
>> understand it is a bit pointless to discuss about this... Sorry, I
>> couldn't resist.

As to my opinion, I'll try to elaborate in "hardware" terms:

[2] states the following: 

| Broadly, non-MCA permits two hardware optimisations: (1) a shared, 
| pre-cache store-buffer that allows early forwarding of data between a 
| subset of the 
| invalidations to other caches participating in a cache-coherence 
| protocol without waiting for their acknowledgement. Whilst these 
| optimisations may be worthwhile (or even necessary) in some contexts, 
| for ARM there was a clear internal conclusion that these optimisations 
| offer little benefit in ARM’s context, partly because the ARM bus 
| architecture (AMBA) has always been MCA

Alpha (as far as I know) didn't have (1) and its reference manual
clearly forbids it (at least in a naive manner where consistency isn't
taken care of through speculation recovery mechanisms). 

If we look at (2), it seems that Alpha did indeed perform writes after
just having received the invalidation acknowledgement (but without
having the other cores actually invalidate the cache line). This is
depicted in the perfbook's Section C.4.2 and some of its references such
as [Gha95, Section 5.4.1]. If we consider the definition of "Multi-copy
atomicity" or ARM's "Other-multi-copy atomicity", it means that a write
becomes visible to every _other_ agent at the same time. 

Then, the question here basically becomes if that having received an 
early invalidation acknowledgement from "Processor i (Pi)" means that 
the write is visible to Pi. This should settle whether Alpha is
non-multicopy atomic or not.

[Gha95, Page 181] says the following

| Maintaining Multiple-Copy Atomicity: As we discussed earlier in this 
| section, enforcing the third category of multiprocessor dependence 
| chains requires providing the illusion of multiple-copy atomicity in 
| addition to maintaining the program order among operations. This 
| section describes the implementation of the two techniques for 
| enforcing multiple-copy atomicity (that were introduced earlier) in 
| more detail

[...] 

| The functionality is not needed at all for the PC, RCpc, and PowerPC 
| models since these models do not enforce any category three 
| multiprocessor dependence chains. For the SC, TSO, IBM-370, PSO, WO, 
| RCsc, Alpha, and RMO models, all writes must be treated conservatively

Note that Alpha is described as enforcing "category three multiprocessor
dependence chains", which from the context we can infer it means being
multi-copy atomic(?). Later on [Gha95, Section 5.4.1] says the following 
when discussing early acknowledgement:

| However, naive applications of this optimization can lead to incorrect 
| implementations since an acknowledgement reply no longer signifies the 
| completion of the write with respect to the target processor. 
| This section describes two distinct implementation techniques that 
| enable the safe use of early acknowledgements

[...]

| The second solution does not impose any ordering constraints
| among incoming messages. Instead, it requires previously committed 
| invalidation and update requests to be serviced anytime program order 
| is enforced for satisfying a multiprocessor dependence chain. In the 
| example, this latter solution would force the invalidation request to 
| be serviced (e.g., by flushing the incoming queue) as part of 
| enforcing the program order from the read of B to the read of A

In other words, I would say that if we put together

1.- "There are no implied barriers in Alpha. If an implied barrier is
     needed for functionally correct access of shared data, it must be
     written as an explicit instruction"

and 

2.- Read memory barriers drain the pending invalidations' queue

it *effectively* means that if a read "R1" has directly or indirectly
observed a write "W", there is a memory barrier (only way to enforce
oredering) and then another read "R2" is performed, it must also see
"W".

Using "WRC" (Listing 15.16) and its associated Table 15.4 as an example, 
it would mean that "r3 = READ_ONCE(*x)" CAN'T read a 0 in Alpha because
the read memory barrier drains the pending invalidation queue.
Therefore, r3 would miss in cache and be served with P0's write.

Thank you *very much* for the discussion :)

[2] https://www.cl.cam.ac.uk/~pes20/armv8-mca/armv8-mca-draft.pdf
[Gha95] perfbook's link is down, it is also available here: 
https://github.com/kaitoukito/Computer-Science-Textbooks/blob/master/Memory-Consistency-Models-for-Shared-Memory-Multiprocessors.pdf

> 
> No need to apologize.
> Any feedback from a fresh minded reader is much appreciated!
> 
>         Thanks, Akira
> 
>>
>> Thanks
>>
>> [3] https://download.majix.org/dec/alpha_arch_ref.pdf
>>
>>
> 
Attachment:
wrc-exists.svg

Description: image/svg