Re: [PATCH RFC 1/2] arch: Introduce ARCH_HAS_HW_XCHG_SMALL

Waiman Long <llong@xxxxxxxxxx> · Mon, 26 Jul 2021 22:46:45 -0400

On 7/26/21 10:29 PM, Boqun Feng wrote:
On Tue, Jul 27, 2021 at 09:27:51AM +0800, Guo Ren wrote:
On Tue, Jul 27, 2021 at 5:21 AM Waiman Long <llong@xxxxxxxxxx> wrote:
On 7/26/21 1:03 PM, Boqun Feng wrote:
On Tue, Jul 27, 2021 at 12:41:34AM +0800, Guo Ren wrote:
On Mon, Jul 26, 2021 at 6:39 PM Boqun Feng <boqun.feng@xxxxxxxxx> wrote:
On Mon, Jul 26, 2021 at 04:56:49PM +0800, Huacai Chen wrote:
Hi, Geert,

On Mon, Jul 26, 2021 at 4:36 PM Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> wrote:
Hi Huacai,

On Sat, Jul 24, 2021 at 2:36 PM Huacai Chen <chenhuacai@xxxxxxxxxxx> wrote:
Introduce a new Kconfig option ARCH_HAS_HW_XCHG_SMALL, which means arch
has hardware sub-word xchg/cmpxchg support. This option will be used as
an indicator to select the bit-field definition in the qspinlock data
structure.

Signed-off-by: Huacai Chen <chenhuacai@xxxxxxxxxxx>
Thanks for your patch!

--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -228,6 +228,10 @@ config ARCH_HAS_FORTIFY_SOURCE
            An architecture should select this when it can successfully
            build and run with CONFIG_FORTIFY_SOURCE.

+# Select if arch has hardware sub-word xchg/cmpxchg support
+config ARCH_HAS_HW_XCHG_SMALL
What do you mean by "hardware"?
Does a software fallback count?
This new option is supposed as an indicator to select bit-field
definition of qspinlock, software fallback is not helpful in this
case.

I don't think this is true. IIUC, the rationale of the config is that
for some architectures, since the architectural cmpxchg doesn't provide
forward-progress guarantee then using cmpxchg of machine-word to
implement xchg{8,16}() may cause livelock, therefore these architectures
don't want to provide xchg{8,16}(), as a result they cannot work with
qspinlock when _Q_PENDING_BITS is 8.

So as long as an architecture can provide and has already provided an
implementation of xchg{8,16}() which guarantee forward-progress (even
though the implementation is using a machine-word size cmpxchg), the
architecture doesn't need to select ARCH_HAS_HW_XCHG_SMALL.
Seems only atomic could provide forward progress, isn't it? And lr/sc
couldn't implement xchg/cmpxchg primitive properly.

I'm missing you point here, a) ll/sc can provide forward progress and b)
ll/sc instructions are used to implement xchg/cmpxchg (see ARM64 and
PPC).

How to make CPU guarantee  "load + cmpxchg" forward-progress? Fusion
these instructions and lock the snoop channel?
Maybe hardware guys would think that it's easier to implement cas +
dcas + amo(short & byte).

Please note that if _Q_PENDING_BITS == 1, then the xchg_tail() is
implemented as a "load + cmpxchg", so if "load + cmpxchg" implementation
of xchg16() doesn't provide forward-progress in an architecture, neither
does xchg_tail().
Agreed. The xchg_tail() for the "_Q_PENDING_BITS == 1" case is a
software emulation of xchg16(). Pure software emulation like that does
not provide forward progress guarantee. This is usually not a big
problem for non-RT kernel for which occasional long latency is
acceptable, but it is not good for RT kernel.
"How to implement xchg_tail" shouldn't force with _Q_PENDING_BITS, but
the arch could choose.
I actually agree with this part, but this patchset failed to provide
enough evidences on why we should choose xchg_tail() implementation
based on whether hardware has xchg16, more precisely, for an archtecture
which doesn't have a hardware xchg16, why cmpxchg emulated xchg16() is
worse than a "load+cmpxchg) implemeneted xchg_tail()? If it's a
performance reason, please show some numbers.

In fact, why don't you introduce a ARCH_QSPINLOCK_USE_GENERIC_XCHG_TAIL,
and only select it for csky and risc-v, and let other archs choose to
select or it themselves? FWIW, qspinlock code looks like something
below with this config:

  #if (CONFIG_NR_CPUS < (1U << 14)) && !defined(CONFIG_ARCH_QSPINLOCK_USE_GENERIC_XCHG_TAIL)
  #define _Q_PENDING_BITS                8
  #else
  #define _Q_PENDING_BITS                1

Just my two cents.

The patch sent out by Guo Ren was actually similar to this approach [1]. 
I actually like it better than this patch because the pending bit 
handling is more optimized with "_Q_PENDING_BITS == 8". So we shouldn't 
force _Q_PENDING_BITS to 1 if we just want to avoid using xchg16().

[1] 
https://lore.kernel.org/lkml/1616868399-82848-4-git-send-email-guoren@xxxxxxxxxx/

Cheers,
Longman