On Wed, 29 May 2024 at 11:50, Maciej W. Rozycki <macro@xxxxxxxxxxx> wrote: > > The only difference here is that with > hardware read-modify-write operations atomicity for sub-word accesses is > guaranteed by the ISA, however for software read-modify-write it has to be > explictly coded using the usual load-locked/store-conditional sequence in > a loop. I have some bad news for you: the old alpha CPU's not only screwed up the byte/word design, they _also_ screwed up the load-locked/store-conditional. You'd think that LL/SC would be done at a cacheline level, like any sane person would do. But no. The 21064 actually did atomicity with an external pin on the bus, the same way people used to do before caches even existed. Yes, it has an internal L1 D$, but it is a write-through cache, and clearly things like cache coherency weren't designed for. In fact, LL/SC is even documented to not work in the external L2 cache ("Bcache" - don't ask me why the odd naming). So LL/SC on the 21064 literally works on external memory. Quoting the reference manual: "A.6 Load Locked and Store Conditional The 21064 provides the ability to perform locked memory accesses through the LDxL (Load_Locked) and STxC (Store_Conditional) cycle command pair. The LDxL command forces the 21064 to bypass the Bcache and request data directly from the external memory interface. The memory interface logic must set a special interlock flag as it returns the data, and may optionally keep the locked address" End result: a LL/SC pair is very very slow. It was incredibly slow even for the time. I had benchmarks, I can't recall them, but I'd like to say "hundreds of cycles". Maybe thousands. So actual reliable byte operations are not realistically possible on the early alpha CPU's. You can do them with LL/SC, sure, but performance would be so horrendously bad that it would be just sad. The 21064A had some "fast lock" mode which allows the data from the LDQ_L to come from the Bcache. So it still isn't exactly fast, and it still didn't work at CPU core speeds, but at least it worked with the external cache. Compilers will generate the sequence that DEC specified, which isn't thread-safe. In fact, it's worse than "not thread safe". It's not even safe on UP with interrupts, or even signals in user space. It's one of those "technically valid POSIX", since there's "sig_atomic_t" and if you do any concurrent signal stuff you're supposed to only use that type. But it's another of those "Yeah, you'd better make sure your structure members are either 'int' or bigger, or never accessed from signals or interrupts, or they might clobber nearby values". Linus