Hi, Ren, > -----Original Messages----- > From: "Guo Ren" <guoren@xxxxxxxxxx> > Sent Time: 2021-07-27 09:07:44 (Tuesday) > To: "Boqun Feng" <boqun.feng@xxxxxxxxx> > Cc: "Huacai Chen" <chenhuacai@xxxxxxxxx>, "Geert Uytterhoeven" <geert@xxxxxxxxxxxxxx>, "Huacai Chen" <chenhuacai@xxxxxxxxxxx>, "Peter Zijlstra" <peterz@xxxxxxxxxxxxx>, "Ingo Molnar" <mingo@xxxxxxxxxx>, "Will Deacon" <will@xxxxxxxxxx>, "Arnd Bergmann" <arnd@xxxxxxxx>, "Waiman Long" <longman@xxxxxxxxxx>, Linux-Arch <linux-arch@xxxxxxxxxxxxxxx>, "Rui Wang" <wangrui@xxxxxxxxxxx>, "Xuefeng Li" <lixuefeng@xxxxxxxxxxx>, "Jiaxun Yang" <jiaxun.yang@xxxxxxxxxxx> > Subject: Re: [PATCH RFC 1/2] arch: Introduce ARCH_HAS_HW_XCHG_SMALL > > On Tue, Jul 27, 2021 at 1:03 AM Boqun Feng <boqun.feng@xxxxxxxxx> wrote: > > > > On Tue, Jul 27, 2021 at 12:41:34AM +0800, Guo Ren wrote: > > > On Mon, Jul 26, 2021 at 6:39 PM Boqun Feng <boqun.feng@xxxxxxxxx> wrote: > > > > > > > > On Mon, Jul 26, 2021 at 04:56:49PM +0800, Huacai Chen wrote: > > > > > Hi, Geert, > > > > > > > > > > On Mon, Jul 26, 2021 at 4:36 PM Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > Hi Huacai, > > > > > > > > > > > > On Sat, Jul 24, 2021 at 2:36 PM Huacai Chen <chenhuacai@xxxxxxxxxxx> wrote: > > > > > > > Introduce a new Kconfig option ARCH_HAS_HW_XCHG_SMALL, which means arch > > > > > > > has hardware sub-word xchg/cmpxchg support. This option will be used as > > > > > > > an indicator to select the bit-field definition in the qspinlock data > > > > > > > structure. > > > > > > > > > > > > > > Signed-off-by: Huacai Chen <chenhuacai@xxxxxxxxxxx> > > > > > > > > > > > > Thanks for your patch! > > > > > > > > > > > > > --- a/arch/Kconfig > > > > > > > +++ b/arch/Kconfig > > > > > > > @@ -228,6 +228,10 @@ config ARCH_HAS_FORTIFY_SOURCE > > > > > > > An architecture should select this when it can successfully > > > > > > > build and run with CONFIG_FORTIFY_SOURCE. > > > > > > > > > > > > > > +# Select if arch has hardware sub-word xchg/cmpxchg support > > > > > > > +config ARCH_HAS_HW_XCHG_SMALL > > > > > > > > > > > > What do you mean by "hardware"? > > > > > > Does a software fallback count? > > > > > This new option is supposed as an indicator to select bit-field > > > > > definition of qspinlock, software fallback is not helpful in this > > > > > case. > > > > > > > > > > > > > I don't think this is true. IIUC, the rationale of the config is that > > > > for some architectures, since the architectural cmpxchg doesn't provide > > > > forward-progress guarantee then using cmpxchg of machine-word to > > > > implement xchg{8,16}() may cause livelock, therefore these architectures > > > > don't want to provide xchg{8,16}(), as a result they cannot work with > > > > qspinlock when _Q_PENDING_BITS is 8. > > > > > > > > So as long as an architecture can provide and has already provided an > > > > implementation of xchg{8,16}() which guarantee forward-progress (even > > > > though the implementation is using a machine-word size cmpxchg), the > > > > architecture doesn't need to select ARCH_HAS_HW_XCHG_SMALL. > > > Seems only atomic could provide forward progress, isn't it? And lr/sc > > > couldn't implement xchg/cmpxchg primitive properly. > > > > > > > I'm missing you point here, a) ll/sc can provide forward progress and b) > > ll/sc instructions are used to implement xchg/cmpxchg (see ARM64 and > > PPC). > I don't think arm64 could provide fwd guarantee with ll/sc, otherwise, > they wouldn't add ARM64_HAS_LSE_ATOMICS for large systems. > > > > > > How to make CPU ç "load + cmpxchg" forward-progress? Fusion > > > these instructions and lock the snoop channel? > > > Maybe hardware guys would think that it's easier to implement cas + > > > dcas + amo(short & byte). > > > > > > > Please note that if _Q_PENDING_BITS == 1, then the xchg_tail() is > > implemented as a "load + cmpxchg", so if "load + cmpxchg" implementation > > of xchg16() doesn't provide forward-progress in an architecture, neither > > does xchg_tail(). > That's the problem of "_Q_PENDING_BITS == 1", no hardware could > provide "load + ALU + cas" fwd guarantee! > > A simple example, atomic a++: > c = READ_ONCE(g_value); > new = c + 1; > while ((old = cmpxchg(&g_value, c, new)) != c) { > c = old; > new = c + 1; > } > > Q: When it runs on CPU0(500Mhz) & CPU1(2Ghz) in one SMP, how do we > prevent CPU1 from starving CPU0? > A: I think the answer is using AMO-add instruction: > atomic_add(1, &g_value); > (If the arch hasn't atomic instructions and using cmpxchg or lr/sc > implement atomic, it's the same problem.) > I think the forward progress are guaranteed while all operations are atomic(ll/sc or amo). If ll/sc runs on a fast cpu, there will be random delays, is that okay? Else, for such hardware, we can't even implement generic spinlock with ll/sc. And I also think that the hardware supports normal store for unlocking. (e.g. arch_spin_unlock) In qspinlock, when _Q_PENDING_BITS == 1, it's available for all hardware, because the clear_pending/clear_pending_set_locked are all atomic operations. Isn't it? Q: Why live lock happens while _Q_PENDING_BITS == 8? A: I found a case is: * CPU A updates sub-word of qpsinlock at high frequency with normal store. * CPU B do xchg_tail with load + cmpxchg, and the value of load is always not equal to the value of ll(cmpxchg). qspinlock: 0: locked 1: pending 2: tail CPU A CPU B 1: 1: <--------------------+ sh $newval, &locked lw $v1, &qspinlock | add $newval, 1 and $t1, $v1, ~mask | b 1b or $t1, $t1, newval | (live lock path) ll $v2, &qspinlock | bne $v1, $v2, 1b -----+ sc $t1, &qspinlock beq $t1, 0, 1b If xchg_tail like this, at least there is no live lock on Loongson xchg_tail: 1: ll $v1, &qspinlock and $t1, $v1, ~mask or $t1, $t1, newval sc $t1, &qspinlock beq $t1, 0, 1b For hardware that ll/sc is based on cache coherency, I think sc is easy to succeed. The ll makes cache-line is exclusive by CPU B, and the store of CPU A needs to acquire exclusive again, the sc may be completed before this. > > > > Regards, > > Boqun > > > > > > > > > > Regards, > > > > Boqun > > > > > > > > > > > > > > > > > --- a/arch/m68k/Kconfig > > > > > > > +++ b/arch/m68k/Kconfig > > > > > > > @@ -5,6 +5,7 @@ config M68K > > > > > > > select ARCH_32BIT_OFF_T > > > > > > > select ARCH_HAS_BINFMT_FLAT > > > > > > > select ARCH_HAS_DMA_PREP_COHERENT if HAS_DMA && MMU && !COLDFIRE > > > > > > > + select ARCH_HAS_HW_XCHG_SMALL > > > > > > > > > > > > M68k CPUs which support the CAS (Compare And Set) instruction do > > > > > > support this on 8-bit, 16-bit, and 32-bit quantities. > > > > > > M68k CPUs which lack CAS use a software implementation, which > > > > > > supports the same quantities. > > > > > > > > > > > > As CAS is used only if CONFIG_RMW_INSNS=y, perhaps this needs > > > > > > a dependency? > > > > > OK, I think this dependency is needed. > > > > > > > > > > Huacai > > > > > > > > > > > > > > > > > select ARCH_HAS_HW_XCHG_SMALL if RMW_INSNS > > > > > > > > > > > > Gr{oetje,eeting}s, > > > > > > > > > > > > Geert > > > > > > > > > > > > -- > > > > > > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@xxxxxxxxxxxxxx > > > > > > > > > > > > In personal conversations with technical people, I call myself a hacker. But > > > > > > when I'm talking to journalists I just say "programmer" or something like that. > > > > > > -- Linus Torvalds > > > > > > > > > > > > -- > > > Best Regards > > > Guo Ren > > > > > > ML: https://lore.kernel.org/linux-csky/ > > > > -- > Best Regards > Guo Ren > > ML: https://lore.kernel.org/linux-csky/ </chenhuacai@xxxxxxxxxxx></chenhuacai@xxxxxxxxxxx></geert@xxxxxxxxxxxxxx></boqun.feng@xxxxxxxxx></boqun.feng@xxxxxxxxx></jiaxun.yang@xxxxxxxxxxx></lixuefeng@xxxxxxxxxxx></wangrui@xxxxxxxxxxx></linux-arch@xxxxxxxxxxxxxxx></longman@xxxxxxxxxx></arnd@xxxxxxxx></will@xxxxxxxxxx></mingo@xxxxxxxxxx></peterz@xxxxxxxxxxxxx></chenhuacai@xxxxxxxxxxx></geert@xxxxxxxxxxxxxx></chenhuacai@xxxxxxxxx></boqun.feng@xxxxxxxxx></guoren@xxxxxxxxxx>