Re: [PATCH 00/14] alpha: cleanups for 6.10

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 31 May 2024 09:32:23 -0700

On Fri, 31 May 2024 at 08:48, Arnd Bergmann <arnd@xxxxxxxx> wrote:
>
>  /* Is this type a native word size -- useful for atomic operations */
>  #define __native_word(t) \
> -       (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
> -        sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
> +       (sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
>
>  #ifdef __OPTIMIZE__
>  # define __compiletime_assert(condition, msg, prefix, suffix)          \
>
> The WRITE_ONCE() calls tend to be there in order to avoid
> expensive atomic or locking when something can be expressed
> with a store that known to be visible atomically (on all other
> architectures).

No, if you go down this road, then you would want to do the same thing
we do for READ_ONCE() - but for a different reason - hook into it for
alpha, and add a memory barrier to get rid of the crazy alpha memory
ordering:

  /*
   * Alpha is apparently daft enough to reorder address-dependent loads
   * on some CPU implementations. Knock some common sense into it with
   * a memory barrier in READ_ONCE().
   *
   * For the curious, more information about this unusual reordering is
   * available in chapter 15 of the "perfbook":
   *
   *  https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html
   *
   */
  #define __READ_ONCE(x)                                                  \
  ({                                                                      \
        __unqual_scalar_typeof(x) __x =                                 \
                (*(volatile typeof(__x) *)(&(x)));                      \
        mb();                                                           \
        (typeof(x))__x;                                                 \
  })

and the solution would be to make a __WRITE_ONCE() that then uses
"sizeof()" to decide at compile-time whether it can just do it as a
regular write, or whether it needs to do it as a LL/SC loop.

Because we're definitely not changing hundreds - probably thousands -
of random generic data structures.

That said, the above fixes WRITE_ONCE() without changing the
definition of what a native word size is, but doesn't actually *fix*
the problem.

Let's take a really simple example:

    struct net_device {
        ...
        u8 reg_state;

        bool dismantle;

        enum {
                RTNL_LINK_INITIALIZED,
                RTNL_LINK_INITIALIZING,
        } rtnl_link_state:16;
        ...

are all in the same 32-bit word, and we intentionally have code
without locking like this:

        WRITE_ONCE(dev->reg_state, NETREG_RELEASED);
...
        return READ_ONCE(dev->reg_state) <= NETREG_REGISTERED;

because the code knows the state machine ordering requirements (ie
once it has moved past NETREG_REGISTERED, it won't move back).

So now - assuming we fix WRITE_ONCE() to use LL/SC, these READ_ONCE()
and WRITE_ONCE() games work fine on alpha

BUT.

Code that then does something like this:

        dev->dismantle = true;

which is all nice and good (accesses are done under the RTNL lock) now
will possibly race with the unlocked reg_state accesses.

So it's still fundamentally buggy.

And before you say "that's why I wanted to fix the __native_word()
definition", please realize that the above happens EVEN WITH the
READ_ONCE/WRITE_ONCE being done on an "int".

Yes, really. The READ_ONCE and WRITE_ONCE will be individual
instructions. But lookie here, if we have

        u32 reg_state;
        bool dismantle;

and they happen to share the same 8-byte word, and somebody passes
'&dismantle' off to something that does byte writes to it, guess what
the canonical byte write sequence is?

That's right, it looks something like this (excuse any bugs, this is
from memory and looking up the ops in the architecture manual):

        LDQ_U tmp,(addr)
        INSBL byte,addr,tmp2
        MSKBL tmp,addr,tmp
        OR tmp,tmp2,tmp
        STQ_U tmp,(addr)

and notice how in the process it read and then wrote that supposedly
atomic 'req_state" that was otherwise accessed purely with 32-bit
atomic instructions?

There are no LDL_U/STL_U instructions. The unaligned memory ops are
always 8 bytes wide (you can obviously always do address masking
manually and "emulate" a LDL_U/STL_U model, but then you make already
bad code generation even *worse*).

So no. Even 32-bit values aren't "atomic" in alpha, because of the
complete sh*t-show that is lack of byte ops.

NOTE NOTE NOTE! Note how I said "pass off the address of
'dev->dismantle' to something that does byte ops"? If you *know* the
alignment of the byte in a structure, so you don't just get a random
pointer to a byte, you can - and should - generate better code on
alpha, which may in fact involve just doing a 32-bit load, masking off
the low bits, and doing the 32-bit store.

So that LDQ_U/STQ_U sequence is for the generic case, with various
simpler sub-cases that don't necessarily require it.

The fact is, the original alpha is the worst architecture ever made.
The lack of byte instructions and the absolutely horrendous memory
ordering are fatal flaws. And while the memory ordering arguably had
excuses for it ("they didn't know better"), the lack of byte ops was
wilful misdesign that the designers were proud of, and made a central
tenet of their mess.

And I say that as somebody who *loved* it originally. Yes, the lack of
byte operations always was a pain, because it really caused the IO
subsystem to be a nightmare, but I was young, I was stupid, it was
interesting, and I had bought into the kool aid.

But alpha without BWX really is shit. People think x86 is bad. Those
people have NO CLUE.

               Linus