On Fri, Jun 17, 2022 at 4:57 PM Huacai Chen <chenhuacai@xxxxxxxxxxx> wrote: > > On NUMA system, the performance of qspinlock is better than generic > spinlock. Below is the UnixBench test results on a 8 nodes (4 cores > per node, 32 cores in total) machine. > The performance increase is nice, but this is only half the story we need here. I think the more important bit is how you can guarantee that the xchg16() implementation is correct and always allows forward progress. >@@ -123,6 +123,10 @@ static inline unsigned long __percpu_xchg(void *ptr, unsigned long val, > int size) > { > switch (size) { >+ case 1: >+ case 2: >+ return __xchg_small((volatile void *)ptr, val, size); >+ Do you actually need the size 1 as well? Generally speaking, I would like to rework the xchg()/cmpxchg() logic to only cover the 32-bit and word-sized (possibly 64-bit) case, while having separate optional 8-bit and 16-bit functions. I had a patch for this in the past, and can try to dig that out, this may be the time to finally do that. I see that the qspinlock() code actually calls a 'relaxed' version of xchg16(), but you only implement the one with the full barrier. Is it possible to directly provide a relaxed version that has something less than the __WEAK_LLSC_MB? Arnd