On Sun, Nov 26, 2023 at 09:06:03AM -0800, Linus Torvalds wrote: > On Sun, 26 Nov 2023 at 08:39, Guo Ren <guoren@xxxxxxxxxx> wrote: > > > > Here, what I want to improve is to prevent stack frame setup in the fast > > path, and that's the most benefit my patch could give out. > > Side note: what patch do you have that avoids the stack frame setup? > Because I still saw the stack frame even with the > arch_spin_value_unlocked() fix and the improved code generation. The > compiler still does > > addi sp,sp,-32 > sd s0,16(sp) > sd s1,8(sp) > sd ra,24(sp) > addi s0,sp,32 I found below affect you: #define CMPXCHG_LOOP(CODE, SUCCESS) do { \ - int retry = 100; \ struct lockref old; \ BUILD_BUG_ON(sizeof(old) != 8); \ old.lock_count = READ_ONCE(lockref->lock_count); \ @@ -21,11 +20,21 @@ new.lock_count))) { \ SUCCESS; \ } \ - if (!--retry) \ - break; \ Yes, The 'retry' count patch [1] hurts us. [1]: https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m091df9dca68c27c28f8f69a72cae0e1361dba4fa > > at the top of the function for me - not because of the (now fixed) > lock value spilling, but just because it wants to save registers. > > The reason seems to be that gcc isn't smart enough to delay the frame > setup to the slow path where it then has to do the actual spinlock, so > it has to generate a stack frame just for the return address and then > it does the whole frame setup thing. > > I was using just the risc-v defconfig (with the cmpxchg lockrefs > enabled, and spinlock debugging disabled so that lockrefs actually do > something), so there might be some other config thing like "force > frame pointers" that then causes problems. > > But while the current tree avoids the silly lock value spill and > reload, and my patch improved the integer instruction selection, I > really couldn't get rid of the stack frame entirely. The x86 code also > ends up looking quite nice, although part of that is that the > qspinlock test is a simple compare against zero: > > lockref_get: > pushq %rbx > movq %rdi, %rbx > movq (%rdi), %rax > movl $-100, %ecx > movabsq $4294967296, %rdx > .LBB0_1: > testl %eax, %eax > jne .LBB0_4 > leaq (%rax,%rdx), %rsi > lock > cmpxchgq %rsi, (%rbx) > je .LBB0_5 > incl %ecx > jne .LBB0_1 > .LBB0_4: > movq %rbx, %rdi > callq _raw_spin_lock > incl 4(%rbx) > movb $0, (%rbx) > .LBB0_5: > popq %rbx > retq > > (That 'movabsq' thing is what generates the big constant that adds '1' > in the upper word - that add is then done as a 'leaq'). > > In this case, the 'retry' count is actually a noticeable part of the > code generation, and is probably also why it has to save/restore > '%rbx'. Oh well. We limited the cmpxchg loop because of horrible > issues with starvation on bad arm64 cores. It turns out that SMP > cacheline bouncing is hard, and if you haven't been doing it for a > couple of decades, you'll do it wrong. > > You'll find out the hard way that the same is probably true on any > early RISC-V SMP setups. You wanting to use prefetchw is a pretty > clear indication of the same kind of thing. The 'retry' count is bad solution, which hides the problem. ThunderX2's problem is mainly about unnecessary cpu_relax & cacheline sticky less. In the AMBA 5 CHI spec "Home behavior" section says: [2] "When a Home(CIU/LLcache) determines that an Exclusive Store transaction has failed, the following rules must be followed: If the Requester has lost the cache line, then the Home is expected to send SnpPreferUniqueFwd or SnpPreferUnique to get a copy of the cache line." The SnpPreferUnique is not SnpUnique, which means it would return a shared cacheline in case of serious contention. No guarantee for the next cmpxchg. But, we want a unique cache line right? You said: [1] "... And once one CPU gets ownership of the line, it doesn't lose it immediately, so the next cmpxchg will *succeed*. So at most, the *first* cmpxchg will fail (because that's the one that was fed not by a previous cmpxchg, but by a regular load (which we'd *like* to do as a "load-for-ownership" load, but we don't have the interfaces to do that). But the second cmpxchg should basically always succeed, ..." (Sorry, I quoted you like this.) My argue is: Why do we need to wait for cmpxchg failure? You have the "load-for-ownership" interface: "prefetchw"! lockref_get: pushq %rbx +------prefetchw (%rdi) --------> doesn't lose it immediately, st| so the next cmpxchg will *succeed* ic| - Linus ky| movq %rdi, %rbx t| movq (%rdi), %rax ------> local acquire, comfortable! im| movl $-100, %ecx e | movabsq $4294967296, %rdx |.LBB0_1: | testl %eax, %eax | jne .LBB0_4 | leaq (%rax,%rdx), %rsi | lock | cmpxchgq %rsi, (%rbx) --> local cas, success! +----- je .LBB0_5 ------> Farewell to the slowpath! If x86 is not a crap machine, "movq+movq+movl+movabsq+testl+jne+leak+ cmpxchg" should be fast enough to satisfy the sticky time. The prefetchw primitive has been defined in include/linux/prefetch.h for many years. The prefetchw has been used for generic code: ➜ linux git:(master) ✗ grep prefetchw mm/ fs/ kernel/ -r mm/slub.c: prefetchw(object + s->offset); mm/slab.c: prefetchw(objp); mm/page_alloc.c: prefetchw(p); mm/page_alloc.c: prefetchw(p + 1); mm/vmscan.c:#define prefetchw_prev_lru_folio(_folio, _base, _field) \ mm/vmscan.c: prefetchw(&prev->_field); \ mm/vmscan.c:#define prefetchw_prev_lru_folio(_folio, _base, _field) do { } while (0) mm/vmscan.c: prefetchw_prev_lru_folio(folio, src, flags); fs/mpage.c: prefetchw(&folio->flags); fs/f2fs/data.c: prefetchw(&page->flags); fs/ext4/readpage.c: prefetchw(&folio->flags); kernel/bpf/cpumap.c: prefetchw(page); kernel/locking/qspinlock.c: prefetchw(next); ➜ linux git:(master) ✗ grep prefetchw drivers/ -r | wc -l 80 The prefetchw is okay for all good hardware. Not like the 'retry' one. [1]: https://lore.kernel.org/lkml/CAHk-=wiEahkwDXpoy=-SzJHNMRXKVSjPa870+eKKenufhO_Hgw@xxxxxxxxxxxxxx/raw [2]: https://kolegite.com/EE_library/datasheets_and_manuals/FPGA/AMBA/IHI0050E_a_amba_5_chi_architecture_spec.pdf > > Linus >