Re: Re: [PATCH RFC 1/2] arch: Introduce ARCH_HAS_HW_XCHG_SMALL

"Wang Rui" <wangrui@xxxxxxxxxxx> · Tue, 27 Jul 2021 09:52:26 +0800 (GMT+08:00)

Hi, Ren,

&gt; -----Original Messages-----
&gt; From: "Guo Ren" <guoren@xxxxxxxxxx>
&gt; Sent Time: 2021-07-27 09:07:44 (Tuesday)
&gt; To: "Boqun Feng" <boqun.feng@xxxxxxxxx>
&gt; Cc: "Huacai Chen" <chenhuacai@xxxxxxxxx>, "Geert Uytterhoeven" <geert@xxxxxxxxxxxxxx>, "Huacai Chen" <chenhuacai@xxxxxxxxxxx>, "Peter Zijlstra" <peterz@xxxxxxxxxxxxx>, "Ingo Molnar" <mingo@xxxxxxxxxx>, "Will Deacon" <will@xxxxxxxxxx>, "Arnd Bergmann" <arnd@xxxxxxxx>, "Waiman Long" <longman@xxxxxxxxxx>, Linux-Arch <linux-arch@xxxxxxxxxxxxxxx>, "Rui Wang" <wangrui@xxxxxxxxxxx>, "Xuefeng Li" <lixuefeng@xxxxxxxxxxx>, "Jiaxun Yang" <jiaxun.yang@xxxxxxxxxxx>
&gt; Subject: Re: [PATCH RFC 1/2] arch: Introduce ARCH_HAS_HW_XCHG_SMALL
&gt; 
&gt; On Tue, Jul 27, 2021 at 1:03 AM Boqun Feng <boqun.feng@xxxxxxxxx> wrote:
&gt; &gt;
&gt; &gt; On Tue, Jul 27, 2021 at 12:41:34AM +0800, Guo Ren wrote:
&gt; &gt; &gt; On Mon, Jul 26, 2021 at 6:39 PM Boqun Feng <boqun.feng@xxxxxxxxx> wrote:
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; On Mon, Jul 26, 2021 at 04:56:49PM +0800, Huacai Chen wrote:
&gt; &gt; &gt; &gt; &gt; Hi, Geert,
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; On Mon, Jul 26, 2021 at 4:36 PM Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> wrote:
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; Hi Huacai,
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; On Sat, Jul 24, 2021 at 2:36 PM Huacai Chen <chenhuacai@xxxxxxxxxxx> wrote:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; Introduce a new Kconfig option ARCH_HAS_HW_XCHG_SMALL, which means arch
&gt; &gt; &gt; &gt; &gt; &gt; &gt; has hardware sub-word xchg/cmpxchg support. This option will be used as
&gt; &gt; &gt; &gt; &gt; &gt; &gt; an indicator to select the bit-field definition in the qspinlock data
&gt; &gt; &gt; &gt; &gt; &gt; &gt; structure.
&gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; Signed-off-by: Huacai Chen <chenhuacai@xxxxxxxxxxx>
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; Thanks for your patch!
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; --- a/arch/Kconfig
&gt; &gt; &gt; &gt; &gt; &gt; &gt; +++ b/arch/Kconfig
&gt; &gt; &gt; &gt; &gt; &gt; &gt; @@ -228,6 +228,10 @@ config ARCH_HAS_FORTIFY_SOURCE
&gt; &gt; &gt; &gt; &gt; &gt; &gt;           An architecture should select this when it can successfully
&gt; &gt; &gt; &gt; &gt; &gt; &gt;           build and run with CONFIG_FORTIFY_SOURCE.
&gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; +# Select if arch has hardware sub-word xchg/cmpxchg support
&gt; &gt; &gt; &gt; &gt; &gt; &gt; +config ARCH_HAS_HW_XCHG_SMALL
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; What do you mean by "hardware"?
&gt; &gt; &gt; &gt; &gt; &gt; Does a software fallback count?
&gt; &gt; &gt; &gt; &gt; This new option is supposed as an indicator to select bit-field
&gt; &gt; &gt; &gt; &gt; definition of qspinlock, software fallback is not helpful in this
&gt; &gt; &gt; &gt; &gt; case.
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; I don't think this is true. IIUC, the rationale of the config is that
&gt; &gt; &gt; &gt; for some architectures, since the architectural cmpxchg doesn't provide
&gt; &gt; &gt; &gt; forward-progress guarantee then using cmpxchg of machine-word to
&gt; &gt; &gt; &gt; implement xchg{8,16}() may cause livelock, therefore these architectures
&gt; &gt; &gt; &gt; don't want to provide xchg{8,16}(), as a result they cannot work with
&gt; &gt; &gt; &gt; qspinlock when _Q_PENDING_BITS is 8.
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; So as long as an architecture can provide and has already provided an
&gt; &gt; &gt; &gt; implementation of xchg{8,16}() which guarantee forward-progress (even
&gt; &gt; &gt; &gt; though the implementation is using a machine-word size cmpxchg), the
&gt; &gt; &gt; &gt; architecture doesn't need to select ARCH_HAS_HW_XCHG_SMALL.
&gt; &gt; &gt; Seems only atomic could provide forward progress, isn't it? And lr/sc
&gt; &gt; &gt; couldn't implement xchg/cmpxchg primitive properly.
&gt; &gt; &gt;
&gt; &gt;
&gt; &gt; I'm missing you point here, a) ll/sc can provide forward progress and b)
&gt; &gt; ll/sc instructions are used to implement xchg/cmpxchg (see ARM64 and
&gt; &gt; PPC).
&gt; I don't think arm64 could provide fwd guarantee with ll/sc, otherwise,
&gt; they wouldn't add ARM64_HAS_LSE_ATOMICS for large systems.
&gt; 
&gt; &gt;
&gt; &gt; &gt; How to make CPU ç  "load + cmpxchg" forward-progress? Fusion
&gt; &gt; &gt; these instructions and lock the snoop channel?
&gt; &gt; &gt; Maybe hardware guys would think that it's easier to implement cas +
&gt; &gt; &gt; dcas + amo(short &amp; byte).
&gt; &gt; &gt;
&gt; &gt;
&gt; &gt; Please note that if _Q_PENDING_BITS == 1, then the xchg_tail() is
&gt; &gt; implemented as a "load + cmpxchg", so if "load + cmpxchg" implementation
&gt; &gt; of xchg16() doesn't provide forward-progress in an architecture, neither
&gt; &gt; does xchg_tail().
&gt; That's the problem of "_Q_PENDING_BITS == 1", no hardware could
&gt; provide "load + ALU + cas" fwd guarantee!
&gt; 
&gt; A simple example, atomic a++:
&gt; c = READ_ONCE(g_value);
&gt; new = c + 1;
&gt; while ((old = cmpxchg(&amp;g_value, c, new)) != c) {
&gt;     c = old;
&gt;     new = c + 1;
&gt; }
&gt; 
&gt; Q: When it runs on CPU0(500Mhz) &amp; CPU1(2Ghz) in one SMP, how do we
&gt; prevent CPU1 from starving CPU0?
&gt; A: I think the answer is using AMO-add instruction:
&gt; atomic_add(1, &amp;g_value);
&gt; (If the arch hasn't atomic instructions and using cmpxchg or lr/sc
&gt; implement atomic, it's the same problem.)
&gt; 

I think the forward progress are guaranteed while all operations are atomic(ll/sc or amo). If ll/sc runs on a fast cpu, there will be random delays, is that okay? Else, for such hardware, we can't even implement generic spinlock with ll/sc.

And I also think that the hardware supports normal store for unlocking. (e.g. arch_spin_unlock)

In qspinlock, when _Q_PENDING_BITS == 1, it's available for all hardware, because the clear_pending/clear_pending_set_locked are all atomic operations. Isn't it?

Q: Why live lock happens while _Q_PENDING_BITS == 8?
A: I found a case is:

* CPU A updates sub-word of qpsinlock at high frequency with normal store.
* CPU B do xchg_tail with load + cmpxchg, and the value of load is always not equal to the value of ll(cmpxchg).

qspinlock:
  0: locked
  1: pending
  2: tail

CPU A                    CPU B
1:                       1: &lt;--------------------+
  sh $newval, &amp;locked      lw  $v1, &amp;qspinlock   |
  add $newval, 1           and $t1, $v1, ~mask   |
  b 1b                     or  $t1, $t1, newval  | (live lock path)
                           ll  $v2, &amp;qspinlock   |
                           bne $v1, $v2, 1b -----+
                           sc  $t1, &amp;qspinlock
                           beq $t1, 0, 1b

If xchg_tail like this, at least there is no live lock on Loongson

xchg_tail:

1:
  ll  $v1, &amp;qspinlock
  and $t1, $v1, ~mask
  or  $t1, $t1, newval
  sc  $t1, &amp;qspinlock
  beq $t1, 0, 1b

For hardware that ll/sc is based on cache coherency, I think sc is easy to succeed. The ll makes cache-line is exclusive by CPU B, and the store of CPU A needs to acquire exclusive again, the sc may be completed before this.

&gt; &gt;
&gt; &gt; Regards,
&gt; &gt; Boqun
&gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; Regards,
&gt; &gt; &gt; &gt; Boqun
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; --- a/arch/m68k/Kconfig
&gt; &gt; &gt; &gt; &gt; &gt; &gt; +++ b/arch/m68k/Kconfig
&gt; &gt; &gt; &gt; &gt; &gt; &gt; @@ -5,6 +5,7 @@ config M68K
&gt; &gt; &gt; &gt; &gt; &gt; &gt;         select ARCH_32BIT_OFF_T
&gt; &gt; &gt; &gt; &gt; &gt; &gt;         select ARCH_HAS_BINFMT_FLAT
&gt; &gt; &gt; &gt; &gt; &gt; &gt;         select ARCH_HAS_DMA_PREP_COHERENT if HAS_DMA &amp;&amp; MMU &amp;&amp; !COLDFIRE
&gt; &gt; &gt; &gt; &gt; &gt; &gt; +       select ARCH_HAS_HW_XCHG_SMALL
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; M68k CPUs which support the CAS (Compare And Set) instruction do
&gt; &gt; &gt; &gt; &gt; &gt; support this on 8-bit, 16-bit, and 32-bit quantities.
&gt; &gt; &gt; &gt; &gt; &gt; M68k CPUs which lack CAS use a software implementation, which
&gt; &gt; &gt; &gt; &gt; &gt; supports the same quantities.
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; As CAS is used only if CONFIG_RMW_INSNS=y, perhaps this needs
&gt; &gt; &gt; &gt; &gt; &gt; a dependency?
&gt; &gt; &gt; &gt; &gt; OK, I think this dependency is needed.
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; Huacai
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt;    select ARCH_HAS_HW_XCHG_SMALL if RMW_INSNS
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; Gr{oetje,eeting}s,
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt;                         Geert
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; --
&gt; &gt; &gt; &gt; &gt; &gt; Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@xxxxxxxxxxxxxx
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; In personal conversations with technical people, I call myself a hacker. But
&gt; &gt; &gt; &gt; &gt; &gt; when I'm talking to journalists I just say "programmer" or something like that.
&gt; &gt; &gt; &gt; &gt; &gt;                                 -- Linus Torvalds
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; --
&gt; &gt; &gt; Best Regards
&gt; &gt; &gt;  Guo Ren
&gt; &gt; &gt;
&gt; &gt; &gt; ML: https://lore.kernel.org/linux-csky/
&gt; 
&gt; 
&gt; 
&gt; -- 
&gt; Best Regards
&gt;  Guo Ren
&gt; 
&gt; ML: https://lore.kernel.org/linux-csky/
</chenhuacai@xxxxxxxxxxx></chenhuacai@xxxxxxxxxxx></geert@xxxxxxxxxxxxxx></boqun.feng@xxxxxxxxx></boqun.feng@xxxxxxxxx></jiaxun.yang@xxxxxxxxxxx></lixuefeng@xxxxxxxxxxx></wangrui@xxxxxxxxxxx></linux-arch@xxxxxxxxxxxxxxx></longman@xxxxxxxxxx></arnd@xxxxxxxx></will@xxxxxxxxxx></mingo@xxxxxxxxxx></peterz@xxxxxxxxxxxxx></chenhuacai@xxxxxxxxxxx></geert@xxxxxxxxxxxxxx></chenhuacai@xxxxxxxxx></boqun.feng@xxxxxxxxx></guoren@xxxxxxxxxx>