On Sat, Jun 18, 2022 at 12:11 AM Arnd Bergmann <arnd@xxxxxxxx> wrote: > > On Fri, Jun 17, 2022 at 4:57 PM Huacai Chen <chenhuacai@xxxxxxxxxxx> wrote: > > > > On NUMA system, the performance of qspinlock is better than generic > > spinlock. Below is the UnixBench test results on a 8 nodes (4 cores > > per node, 32 cores in total) machine. > > > > The performance increase is nice, but this is only half the story we need here. > > I think the more important bit is how you can guarantee that the xchg16() > implementation is correct and always allows forward progress. > > >@@ -123,6 +123,10 @@ static inline unsigned long __percpu_xchg(void *ptr, unsigned long val, > > int size) > > { > > switch (size) { > >+ case 1: > >+ case 2: > >+ return __xchg_small((volatile void *)ptr, val, size); > >+ > > Do you actually need the size 1 as well? > > Generally speaking, I would like to rework the xchg()/cmpxchg() logic > to only cover the 32-bit and word-sized (possibly 64-bit) case, while > having separate optional 8-bit and 16-bit functions. I had a patch for > this in the past, and can try to dig that out, this may be the time to > finally do that. > > I see that the qspinlock() code actually calls a 'relaxed' version of xchg16(), > but you only implement the one with the full barrier. Is it possible to > directly provide a relaxed version that has something less than the > __WEAK_LLSC_MB? There is no __WEAK_LLSC_MB in __xchg_small, and it's Full-fence + xchg16_relaxed() + Full-fence (by hev explained). The __cmpxchg_small isn't related to qspinlock, we could drop it from this patch. > > Arnd -- Best Regards Guo Ren ML: https://lore.kernel.org/linux-csky/