On Mon, Mar 29, 2021 at 9:52 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > On Sat, Mar 27, 2021 at 06:06:38PM +0000, guoren@xxxxxxxxxx wrote: > > From: Guo Ren <guoren@xxxxxxxxxxxxxxxxx> > > > > Some architectures don't have sub-word swap atomic instruction, > > they only have the full word's one. > > > > The sub-word swap only improve the performance when: > > NR_CPUS < 16K > > * 0- 7: locked byte > > * 8: pending > > * 9-15: not used > > * 16-17: tail index > > * 18-31: tail cpu (+1) > > > > The 9-15 bits are wasted to use xchg16 in xchg_tail. > > > > Please let architecture select xchg16/xchg32 to implement > > xchg_tail. > > So I really don't like this, this pushes complexity into the generic > code for something that's really not needed. > > Lots of RISC already implement sub-word atomics using word ll/sc. > Obviously they're not sharing code like they should be :/ See for > example arch/mips/kernel/cmpxchg.c. That is what the previous version of the patch set did, right? I think this v4 is nicer because the code is already there in qspinlock.c and just gets moved around, and the implementation is likely more efficient this way. The mips version could be made more generic, but it is also less efficient than a simple xchg since it requires an indirect function call plus nesting a pair of loops instead in place of the single single ll/sc loop in the 32-bit xchg. I think the weakly typed xchg/cmpxchg implementation causes more problems than it solves, and we'd be better off using a stronger version in general, with the 8-bit and 16-bit exchanges using separate helpers in the same way that the fixed-length cmpxchg64 is separate already, there are only a couple of instances for each of these in the kernel. Unfortunately, there is roughly a 50:50 split between fixed 32-bit and long/pointer-sized xchg/cmpxchg users in the kernel, so making the interface completely fixed-type would add a ton of churn. I created an experimental patch for this, but it's probably not worth it. > Also, I really do think doing ticket locks first is a far more sensible > step. I think this is the important point to look at first. From what I can tell, most users of ticket spinlocks have moved to qspinlock over time, but I'm not sure about the exact tradeoff, in particular on large machines without a 16-bit xchg operation. FWIW, the implementations in the other SMP capable architectures are alpha simple arc simple+custom arm ticket arm64 qspinlock (formerly ticket) csky ticket/qrwlock (formerly simple+ticket/qrwlock) hexagon simple ia64 ticket mips qspinlock (formerly ticket) openrisc qspinlock parisc custom powerpc simple+qspinlock (formerly simple) riscv simple s390 custom-queue sh simple sparc qspinlock x86 qspinlock (formerly ticket+qspinlock) xtensa qspinlock 32-bit Arm is the only relevant user of ticket locks these days, but its hardware has a practical limit of 16 CPUs and four nodes, with most implementations having a single CPU cluster of at most four cores. We had the same discussion about xchg16() when Sebastian submitted the arm32 qspinlock implementation: https://lore.kernel.org/linux-arm-kernel/20191007214439.27891-1-sebastian@xxxxxxxxxxxxx/ Arnd