On Wed, 29 May 2024, Linus Torvalds wrote: > > The only difference here is that with > > hardware read-modify-write operations atomicity for sub-word accesses is > > guaranteed by the ISA, however for software read-modify-write it has to be > > explictly coded using the usual load-locked/store-conditional sequence in > > a loop. > > I have some bad news for you: the old alpha CPU's not only screwed up > the byte/word design, they _also_ screwed up the > load-locked/store-conditional. > > You'd think that LL/SC would be done at a cacheline level, like any > sane person would do. > > But no. > > The 21064 actually did atomicity with an external pin on the bus, the > same way people used to do before caches even existed. Umm, 8086's LOCK#, anyone? > Yes, it has an internal L1 D$, but it is a write-through cache, and > clearly things like cache coherency weren't designed for. In fact, > LL/SC is even documented to not work in the external L2 cache > ("Bcache" - don't ask me why the odd naming). Board cache, I suppose. > So LL/SC on the 21064 literally works on external memory. > > Quoting the reference manual: > > "A.6 Load Locked and Store Conditional > The 21064 provides the ability to perform locked memory accesses through > the LDxL (Load_Locked) and STxC (Store_Conditional) cycle command pair. > The LDxL command forces the 21064 to bypass the Bcache and request data > directly from the external memory interface. The memory interface logic must > set a special interlock flag as it returns the data, and may > optionally keep the > locked address" > > End result: a LL/SC pair is very very slow. It was incredibly slow > even for the time. I had benchmarks, I can't recall them, but I'd like > to say "hundreds of cycles". Maybe thousands. Interesting and disappointing, given how many years the Alpha designers had to learn from the MIPS R4000. Which they borrowed from already after all and which they had first-hand experience with present onboard, from the R4000 DECstation systems built at their WSE facility. Hmm, I wonder if there was patent avoidance involved. > So actual reliable byte operations are not realistically possible on > the early alpha CPU's. You can do them with LL/SC, sure, but > performance would be so horrendously bad that it would be just sad. Hmm, performance with a 30 years old system? Who cares! It mattered 30 years ago, maybe 25. And the performance of a system that runs slowly is still infinitely better than one of a system that doesn't boot anymore, isn't it? > The 21064A had some "fast lock" mode which allows the data from the > LDQ_L to come from the Bcache. So it still isn't exactly fast, and it > still didn't work at CPU core speeds, but at least it worked with the > external cache. > > Compilers will generate the sequence that DEC specified, which isn't > thread-safe. > > In fact, it's worse than "not thread safe". It's not even safe on UP > with interrupts, or even signals in user space. Ouch, I find it a surprising oversight. Come to think of it indeed the plain unlocked read-modify-write sequences are unsafe. I don't suppose any old DECies are still around, but any idea how this was sorted in DEC's own commercial operating systems (DU and OVMS)? So this seems like something that needs to be sorted in the compiler, by always using a locked sequence for 8-bit and 16-bit writes with non-BWX targets. I can surely do it myself, not a big deal, and I reckon such a change to GCC should be pretty compact and self-contained, as all the bits are already within `alpha_expand_mov_nobwx' anyway. I'm not sure if Richard will be happy to accept it, but it seems to me the right thing to do at this point and with that in place there should be no safety concern for RCU or anything with the old Alphas, with no effort at all on the Linux side as all the burden will be on the compiler. We may want to probe for the associated compiler option though and bail out if unsupported. Will it be enough to keep Linux support at least until the next obstacle? Maciej