On Sun, 26 Nov 2023 at 08:39, Guo Ren <guoren@xxxxxxxxxx> wrote: > > Here, what I want to improve is to prevent stack frame setup in the fast > path, and that's the most benefit my patch could give out. Side note: what patch do you have that avoids the stack frame setup? Because I still saw the stack frame even with the arch_spin_value_unlocked() fix and the improved code generation. The compiler still does addi sp,sp,-32 sd s0,16(sp) sd s1,8(sp) sd ra,24(sp) addi s0,sp,32 at the top of the function for me - not because of the (now fixed) lock value spilling, but just because it wants to save registers. The reason seems to be that gcc isn't smart enough to delay the frame setup to the slow path where it then has to do the actual spinlock, so it has to generate a stack frame just for the return address and then it does the whole frame setup thing. I was using just the risc-v defconfig (with the cmpxchg lockrefs enabled, and spinlock debugging disabled so that lockrefs actually do something), so there might be some other config thing like "force frame pointers" that then causes problems. But while the current tree avoids the silly lock value spill and reload, and my patch improved the integer instruction selection, I really couldn't get rid of the stack frame entirely. The x86 code also ends up looking quite nice, although part of that is that the qspinlock test is a simple compare against zero: lockref_get: pushq %rbx movq %rdi, %rbx movq (%rdi), %rax movl $-100, %ecx movabsq $4294967296, %rdx .LBB0_1: testl %eax, %eax jne .LBB0_4 leaq (%rax,%rdx), %rsi lock cmpxchgq %rsi, (%rbx) je .LBB0_5 incl %ecx jne .LBB0_1 .LBB0_4: movq %rbx, %rdi callq _raw_spin_lock incl 4(%rbx) movb $0, (%rbx) .LBB0_5: popq %rbx retq (That 'movabsq' thing is what generates the big constant that adds '1' in the upper word - that add is then done as a 'leaq'). In this case, the 'retry' count is actually a noticeable part of the code generation, and is probably also why it has to save/restore '%rbx'. Oh well. We limited the cmpxchg loop because of horrible issues with starvation on bad arm64 cores. It turns out that SMP cacheline bouncing is hard, and if you haven't been doing it for a couple of decades, you'll do it wrong. You'll find out the hard way that the same is probably true on any early RISC-V SMP setups. You wanting to use prefetchw is a pretty clear indication of the same kind of thing. Linus