On Wed, Apr 7, 2021 at 1:36 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Wed, Apr 07, 2021 at 10:42:50AM +0200, Arnd Bergmann wrote: > > Since there are really only a handful of instances in the kernel > > that use the cmpxchg() or xchg() on u8/u16 variables, it would seem > > best to just disallow those completely > > Not going to happen. xchg16 is optimal for qspinlock and if we replace > that with a cmpxchg loop on x86 we're regressing. I did not mean to remove the possibility of doing a 16-bit compare-exchange operation altogether, just removing it from the regular macros. We already do this for the 64-bit xchg64()/cmpxchg64(), which only some of the 32-bit architectures provide, so I think having an explicit xchg8()/xchg16()/cmpxchg8()/cmpxchg16() interface while tightening the type checking would be more consistent here. On any 32-bit architecture, cmpxchg()/xchg() macros then only have to deal with word-sized operations. > > Interestingly, the s390 version using __sync_val_compare_and_swap() > > seems to produce nice output on all architectures that have atomic > > instructions, with any supported compiler, to the point where I think > > we could just use that to replace most of the inline-asm versions except > > for arm64: > > > > #define cmpxchg(ptr, o, n) \ > > ({ \ > > __typeof__(*(ptr)) __o = (o); \ > > __typeof__(*(ptr)) __n = (n); \ > > (__typeof__(*(ptr))) __sync_val_compare_and_swap((ptr),__o,__n);\ > > }) > > It generates the LL/SC loop, but doesn't do sensible optimizations when > it's again used in a loop itself. That is, it generates a loop of a > loop, just like what you'd expect, which is sub-optimal for LL/SC. One thing that it does though is to generate an ll/sc loop for xchg16(), while mips, openrisc, xtensa and sparc64 currently open-code the nested loop in their respective xchg16() wrappers. I have not seen any case in which it produces object code that is worse than the architecture specific code does today, except for those that rely on runtime patching (i486+smp, arm64+lse). > > Not how gcc's acquire/release behavior of __sync_val_compare_and_swap() > > relates to what the kernel wants here. > > > > The gcc documentation also recommends using the standard > > __atomic_compare_exchange_n() builtin instead, which would allow > > constructing release/acquire/relaxed versions as well, but I could not > > get it to produce equally good output. (possibly I was using it wrong) > > I'm scared to death of the C11 crap, the compiler will 'optimize' them > when it feels like it and use the C11 memory model rules for it, which > are not compatible with the kernel rules. > > But the same thing applies, it won't do the right thing for composites. Makes sense. As I said, I could not even get it to produce optimal code for the simple case. Arnd