Re: [PATCH 12/19] LoongArch: Add misc common routines

Huacai Chen <chenhuacai@xxxxxxxxx> · Sat, 24 Jul 2021 20:53:48 +0800

Hi, Arnd,

On Fri, Jul 23, 2021 at 7:43 PM Arnd Bergmann <arnd@xxxxxxxx> wrote:
>
> On Fri, Jul 23, 2021 at 12:41 PM Huacai Chen <chenhuacai@xxxxxxxxx> wrote:
> > On Tue, Jul 6, 2021 at 6:22 PM Arnd Bergmann <arnd@xxxxxxxx> wrote:
> > > On Tue, Jul 6, 2021 at 6:18 AM Huacai Chen <chenhuacai@xxxxxxxxxxx> wrote:
> > > > +static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
> > > > +                                     unsigned long new, unsigned int size)
> > > > +{
> > > > +       switch (size) {
> > > > +       case 1:
> > > > +       case 2:
> > > > +               return __cmpxchg_small(ptr, old, new, size);
> > >
> > > Same here.
> >
> > 16bit cmpxchg is used by qspinlock. Yes, you suggest we should not use
> > qspinlock, but our test results show that ticket spinlock is even
> > worse... So, we want to keep cmpxchg_small() and qspinlock.
>
> Can you explain in detail how that cmpxchg loop provides the necessary
> forward progress guarantees to make qspinlock reliable?
>
> As Peter keeps pointing out, usually this is not actually the case, so
> faster-but-broken is not a replacement for a simple working version.
>
> If you already have the ticket spinlock implementation, maybe start out
> by using that for the initial submission, and then provide an extra
> patch to convert it to qspinlock as a follow-up that can be debated
> in more detail regarding correctness and performance.
After reading the existing topics on qspinlock from mail list, I have
done an offline discussion with Jiaxun Yang and Rui Wang. Then we
think that if we use the _Q_PENDING_BITS=1 definition, those archs
without sub-word xchg/cmpxchg can use qspinlock as well. Plese see my
RFC patches:
https://lore.kernel.org/linux-arch/20210724123617.3525377-1-chenhuacai@xxxxxxxxxxx/T/#t

Huacai
>
> > > > +static inline void __iomem *ioremap_prot(phys_addr_t offset, unsigned long size,
> > > > +       unsigned long prot_val)
> > > > +{
> > > > +       /* This only works for !HIGHMEM currently */
> > >
> > > Do you support highmem? I would expect new architectures to no longer
> > > implement that. Just use a 64-bit kernel on systems with lots of ram.
> >
> > Emmm, 64-bit kernel doesn't need highmem.
>
> Yes, that was my point: 32-bit user space is fine if you care a lot about
> size constraints of DDR2 or older memory. For any system that has
> enough memory to require highmem, it is better to just pick a 64-bit
> kernel to start with, if the CPU allows it.
>
> > > > +#define ioremap(offset, size)                                  \
> > > > +       ioremap_prot((offset), (size), _CACHE_SUC)
> > > > +#define ioremap_uc ioremap
> > >
> > > Remove ioremap_uc(), it should never be called here.
> > It is used by lib/devres.c.
>
> There is a default implementation in include/asm-generic/io.h
> that just returns NULL here.
>
> > > > +#define __BUILD_MEMORY_SINGLE(pfx, bwlq, type)                         \
> > > > +                                                                       \
> > > > +static inline void pfx##write##bwlq(type val,                          \
> > > > +                                   volatile void __iomem *mem)         \
> > > > +{                                                                      \
> > >
> > > Please don't add another copy of these macros. Use the version from
> > > include/asm-generic, or modify it as needed if it doesn't quite work.
> >
> > On Loongson platform, we should put a wmb() before MMIO write. The
> > generic readw()/readl()/outw()/outl() have wmb(), but the __raw
> > versions don't have. I want to know what is the design goal of the
> > __raw version, are they supposed to be used in scenarios that the
> > ordering needn't be cared?
>
> The __raw versions are mainly meant for accessing memory from
> platform specific driver code. They don't provide any particular
> endianness or ordering guarantees and generally cannot be used
> in portable drivers.
>
> Note that a full wmb() should not be needed, you only have to serialize
> between prior memory accesses and a DMA triggered by the device,
> but not between multiple CPUs here.
>
>       Arnd