On Fri, 31 May 2024 at 08:48, Arnd Bergmann <arnd@xxxxxxxx> wrote: > > /* Is this type a native word size -- useful for atomic operations */ > #define __native_word(t) \ > - (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \ > - sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long)) > + (sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long)) > > #ifdef __OPTIMIZE__ > # define __compiletime_assert(condition, msg, prefix, suffix) \ > > The WRITE_ONCE() calls tend to be there in order to avoid > expensive atomic or locking when something can be expressed > with a store that known to be visible atomically (on all other > architectures). No, if you go down this road, then you would want to do the same thing we do for READ_ONCE() - but for a different reason - hook into it for alpha, and add a memory barrier to get rid of the crazy alpha memory ordering: /* * Alpha is apparently daft enough to reorder address-dependent loads * on some CPU implementations. Knock some common sense into it with * a memory barrier in READ_ONCE(). * * For the curious, more information about this unusual reordering is * available in chapter 15 of the "perfbook": * * https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html * */ #define __READ_ONCE(x) \ ({ \ __unqual_scalar_typeof(x) __x = \ (*(volatile typeof(__x) *)(&(x))); \ mb(); \ (typeof(x))__x; \ }) and the solution would be to make a __WRITE_ONCE() that then uses "sizeof()" to decide at compile-time whether it can just do it as a regular write, or whether it needs to do it as a LL/SC loop. Because we're definitely not changing hundreds - probably thousands - of random generic data structures. That said, the above fixes WRITE_ONCE() without changing the definition of what a native word size is, but doesn't actually *fix* the problem. Let's take a really simple example: struct net_device { ... u8 reg_state; bool dismantle; enum { RTNL_LINK_INITIALIZED, RTNL_LINK_INITIALIZING, } rtnl_link_state:16; ... are all in the same 32-bit word, and we intentionally have code without locking like this: WRITE_ONCE(dev->reg_state, NETREG_RELEASED); ... return READ_ONCE(dev->reg_state) <= NETREG_REGISTERED; because the code knows the state machine ordering requirements (ie once it has moved past NETREG_REGISTERED, it won't move back). So now - assuming we fix WRITE_ONCE() to use LL/SC, these READ_ONCE() and WRITE_ONCE() games work fine on alpha BUT. Code that then does something like this: dev->dismantle = true; which is all nice and good (accesses are done under the RTNL lock) now will possibly race with the unlocked reg_state accesses. So it's still fundamentally buggy. And before you say "that's why I wanted to fix the __native_word() definition", please realize that the above happens EVEN WITH the READ_ONCE/WRITE_ONCE being done on an "int". Yes, really. The READ_ONCE and WRITE_ONCE will be individual instructions. But lookie here, if we have u32 reg_state; bool dismantle; and they happen to share the same 8-byte word, and somebody passes '&dismantle' off to something that does byte writes to it, guess what the canonical byte write sequence is? That's right, it looks something like this (excuse any bugs, this is from memory and looking up the ops in the architecture manual): LDQ_U tmp,(addr) INSBL byte,addr,tmp2 MSKBL tmp,addr,tmp OR tmp,tmp2,tmp STQ_U tmp,(addr) and notice how in the process it read and then wrote that supposedly atomic 'req_state" that was otherwise accessed purely with 32-bit atomic instructions? There are no LDL_U/STL_U instructions. The unaligned memory ops are always 8 bytes wide (you can obviously always do address masking manually and "emulate" a LDL_U/STL_U model, but then you make already bad code generation even *worse*). So no. Even 32-bit values aren't "atomic" in alpha, because of the complete sh*t-show that is lack of byte ops. NOTE NOTE NOTE! Note how I said "pass off the address of 'dev->dismantle' to something that does byte ops"? If you *know* the alignment of the byte in a structure, so you don't just get a random pointer to a byte, you can - and should - generate better code on alpha, which may in fact involve just doing a 32-bit load, masking off the low bits, and doing the 32-bit store. So that LDQ_U/STQ_U sequence is for the generic case, with various simpler sub-cases that don't necessarily require it. The fact is, the original alpha is the worst architecture ever made. The lack of byte instructions and the absolutely horrendous memory ordering are fatal flaws. And while the memory ordering arguably had excuses for it ("they didn't know better"), the lack of byte ops was wilful misdesign that the designers were proud of, and made a central tenet of their mess. And I say that as somebody who *loved* it originally. Yes, the lack of byte operations always was a pain, because it really caused the IO subsystem to be a nightmare, but I was young, I was stupid, it was interesting, and I had bought into the kool aid. But alpha without BWX really is shit. People think x86 is bad. Those people have NO CLUE. Linus