On Wed, 18 Sept 2024 at 13:15, Christoph Lameter (Ampere) <cl@xxxxxxxxxx> wrote: > > Other arches do not have acquire / release and will create additional > barriers in the fallback implementation of smp_load_acquire. So it needs > to be an arch config option. Actually, I looked at a few cases, and it doesn't really seem to be true. For example, powerpc doesn't have a "native" acquire model, but both smp_load_acquire() and smp_rmb() end up being LWSYNC after the load (which in the good case is a "lwsync" instruction, in bad case it's a heavier "sync" instruction on older cores, but the point is that it's the same thing for smp_rmb() and for smp_load_acquire()). So on powerpc, smp_load_acquire() isn't any better than "READ_ONCE()+smp_rmb()", but it also isn't any worse. And at least alpha is the same - it doesn't have smp_load_acquire(), and it falls back on a full memory barrier for that case - but that's what smp_rmb() is too. However, because READ_ONCE() on alpha already contains a smp_mb(), it turns out that on alpha having "READ_ONCE + smp_rmb()" actually results in *two* barriers, while a "smp_load_acquire()" is just one. And obviously technically x86 doesn't have explicit acquire, but with every load being an acquire, it's a no-op either way. So on at least three very different architectures, smp_load_acquire() is at least no worse than READ_ONCE() followed by a smp_rmb(). And on alpha and arm64, it's better. So it does look like making it conditional doesn't actually buy us anything. We might as well just unconditionally use the smp_load_acquire() over "READ_ONCE+smp_rmb". Other random architectures from a quick look: RISC-V technically turns smp_rmb() into a "fence r,r", while a smp_load_acquire() ends up being a "fence r,rw", so technically the fences are different. But honestly, any microarchitecture that makes those two be different is just crazy garbage (there's never any valid reason to move later writes up before earlier reads). Loongarch has acquire and is better off with it. parisc has acquire and is better off with it. s390 and sparc64 are like x86, in that it's just a build barrier either way. End result: let's just simplify the patch and make it entirely unconditional. Linus