Re: lockless case of retain_dentry() (was Re: [PATCH 09/15] fold the call of retain_dentry() into fast_dput())

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sun, 26 Nov 2023 09:06:03 -0800

On Sun, 26 Nov 2023 at 08:39, Guo Ren <guoren@xxxxxxxxxx> wrote:
>
> Here, what I want to improve is to prevent stack frame setup in the fast
> path, and that's the most benefit my patch could give out.

Side note: what patch do you have that avoids the stack frame setup?
Because I still saw the stack frame even with the
arch_spin_value_unlocked() fix and the improved code generation. The
compiler still does

        addi    sp,sp,-32
        sd      s0,16(sp)
        sd      s1,8(sp)
        sd      ra,24(sp)
        addi    s0,sp,32

at the top of the function for me - not because of the (now fixed)
lock value spilling, but just because it wants to save registers.

The reason seems to be that gcc isn't smart enough to delay the frame
setup to the slow path where it then has to do the actual spinlock, so
it has to generate a stack frame just for the return address and then
it does the whole frame setup thing.

I was using just the risc-v defconfig (with the cmpxchg lockrefs
enabled, and spinlock debugging disabled so that lockrefs actually do
something), so there might be some other config thing like "force
frame pointers" that then causes problems.

But while the current tree avoids the silly lock value spill and
reload, and my patch improved the integer instruction selection, I
really couldn't get rid of the stack frame entirely. The x86 code also
ends up looking quite nice, although part of that is that the
qspinlock test is a simple compare against zero:

  lockref_get:
        pushq   %rbx
        movq    %rdi, %rbx
        movq    (%rdi), %rax
        movl    $-100, %ecx
        movabsq $4294967296, %rdx
  .LBB0_1:
        testl   %eax, %eax
        jne     .LBB0_4
        leaq    (%rax,%rdx), %rsi
        lock
        cmpxchgq        %rsi, (%rbx)
        je      .LBB0_5
        incl    %ecx
        jne     .LBB0_1
  .LBB0_4:
        movq    %rbx, %rdi
        callq   _raw_spin_lock
        incl    4(%rbx)
        movb    $0, (%rbx)
  .LBB0_5:
        popq    %rbx
        retq

(That 'movabsq' thing is what generates the big constant that adds '1'
in the upper word - that add is then done as a 'leaq').

In this case, the 'retry' count is actually a noticeable part of the
code generation, and is probably also why it has to save/restore
'%rbx'. Oh well. We limited the cmpxchg loop because of horrible
issues with starvation on bad arm64 cores.  It turns out that SMP
cacheline bouncing is hard, and if you haven't been doing it for a
couple of decades, you'll do it wrong.

You'll find out the hard way that the same is probably true on any
early RISC-V SMP setups. You wanting to use prefetchw is a pretty
clear indication of the same kind of thing.

             Linus