On Wed, Nov 22, 2023 at 09:20:53AM -0800, Linus Torvalds wrote: > On Tue, 21 Nov 2023 at 23:19, Guo Ren <guoren@xxxxxxxxxx> wrote: > > > > We discussed x86 qspinlock code generation. It looked not too bad as I > > thought because qspinlock_spin_value_unlocked is much cheaper than the > > ticket-lock. But the riscv ticket-lock code generation is terrible > > because of the shift left & right 16-bit. > > https://lore.kernel.org/all/ZNG2tHFOABSXGCVi@xxxxxxxxx > > No, it's not the 16-bit shifts in the spin_value_unlocked() check, > that just generates simple and straightforward code: > > a0: 0107569b srlw a3,a4,0x10 > a4: 00c77733 and a4,a4,a2 > a8: 04e69063 bne a3,a4,e8 <.L12> > > (plus two stupid instructions for generating the immediate in a2 for > 0xffff, but hey, that's the usual insane RISC-V encoding thing - you > can load a 20-bit U-immediate only shifted up by 12, if it's in the > lower bits you're kind of screwed and limited to 12-bit immediates). > > The *bad* code generation is from the much simpler > > new.count++; > > which sadly neither gcc not clang is quite smart enough to understand > that "hey, I can do that in 64 bits". > > It's incrementing the higher 32-bit word in a 64-bit union, and with a > smarter compiler it *should* basically become > > lock_count += 1 << 32; > > but the compiler isn't that clever, so it splits the 64-bit word into > two 32-bit words, increments one of them, and then merges the two > words back into 64 bits: > > 98: 4207d693 sra a3,a5,0x20 > 9c: 02079713 sll a4,a5,0x20 > a0: 0016869b addw a3,a3,1 > a4: 02069693 sll a3,a3,0x20 > a8: 02075713 srl a4,a4,0x20 > ac: 00d76733 or a4,a4,a3 > > which is pretty sad. 9c & a8 is for word-zero-extend; riscv would have zext.w in the future. Your patch may improve above with: li a4,1 slli a4,a4,32 add a4,a5,a4 v.s. sra a3,a5,0x20 zext.w a4,a5 addw a3,a3,1 or a4,a4,a3 You win one instruction "or a4,a4,a3", which is less than one cycle. The zext.w is important, and it could replace sll+srl a lot, so I think it's a current ISA design short. Here, what I want to improve is to prevent stack frame setup in the fast path, and that's the most benefit my patch could give out. Unnecessary memory access is the most important performance killer in SMP. My patch removes the stack frame setup from the fast path. void lockref_get(struct lockref *lockref) { 78: 00053783 ld a5,0(a0) 000000000000007c <.LBB212>: 7c: 00010637 lui a2,0x10 0000000000000080 <.LBE212>: 80: 06400593 li a1,100 0000000000000084 <.LBB216>: 84: fff60613 add a2,a2,-1 # ffff <.LLST8+0xf4aa> 0000000000000088 <.L8>: 88: 0007871b sext.w a4,a5 000000000000008c <.LBB217>: 8c: 0107d69b srlw a3,a5,0x10 90: 00c77733 and a4,a4,a2 94: 04e69063 bne a3,a4,d4 <.L12> --------+ | 0000000000000098 <.LBB218>: | 98: 4207d693 sra a3,a5,0x20 | 9c: 02079713 sll a4,a5,0x20 | a0: 0016869b addw a3,a3,1 | a4: 02069693 sll a3,a3,0x20 | a8: 02075713 srl a4,a4,0x20 | ac: 00d76733 or a4,a4,a3 | | 00000000000000b0 <.L0^B1>: | b0: 100536af lr.d a3,(a0) | b4: 00f69863 bne a3,a5,c4 <.L1^B1> | b8: 1ae5382f sc.d.rl a6,a4,(a0) | bc: fe081ae3 bnez a6,b0 <.L0^B1> | c0: 0330000f fence rw,rw | | 00000000000000c4 <.L1^B1>: | c4: 04d78a63 beq a5,a3,118 <.L18> | | 00000000000000c8 <.LBE228>: | c8: fff5859b addw a1,a1,-1 | | 00000000000000cc <.LBB229>: | cc: 00068793 mv a5,a3 | | 00000000000000d0 <.LBE229>: | d0: fa059ce3 bnez a1,88 <.L8> | | 00000000000000d4 <.L12>: <--------------------------------------+ { slow_path d4: fe010113 add sp,sp,-32 d8: 00113c23 sd ra,24(sp) dc: 00813823 sd s0,16(sp) e0: 02010413 add s0,sp,32 > > If you want to do the optimization that the compiler misses by hand, > it would be something like the attached patch. > > NOTE! Very untested. But that *should* cause the compiler to just > generate a single "add" instruction (in addition to generating the > constant 0x100000000, of course). > > Of course, on a LL/SC architecture like RISC-V, in an *optimal* world, > the whole sequence would actually be done with one single LL/SC, > rather than the "load,add,cmpxchg" thing. > > But then you'd have to do absolutely everything by hand in assembly. No, it's not worth to do that. - There are only atomic primitives in Linux, but no ll/sc primitive in the real world. The world belongs to AMO and the only usage of ll/sc is to implement AMO and CAS. - Using single ll/sc primitive instead of cmpxchg is similar to your patch, and you may win 1 cycle or not. - The critical work here are reducing bus transactions, preventing cache dance, and forward progress guarantee. Here is my optimization advice: #define CMPXCHG_LOOP(CODE, SUCCESS) do { \ int retry = 100; \ struct lockref old; \ BUILD_BUG_ON(sizeof(old) != 8); \ + prefetchw(lockref); \ old.lock_count = READ_ONCE(lockref->lock_count); \ while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ struct lockref new = old; \ CODE \ if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ &old.lock_count, \ new.lock_count))) { \ SUCCESS; \ } \ Micro-arch could give prefetchw more guarantee: - Prefetch.w must guarantee cache line exclusiveness even when a shareable state cache line hits. - Hold the exclusive cache line for several cycles until the next store or timeout - Mask interrupt during the holding cycles (Optional) The lockref slow path is killed in this micro-architecture, which means there is no chance to execute the spinlock. I've written down more details in my ppt: https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing This type of prefetchw could help large-size atomic operations within one cache line. Compared to the transaction memory model, prefetchw could give a forward progress guarantee and easier landing in Linux without any new primitive. > > Linus > lib/lockref.c | 17 ++++++++++++++--- > 1 file changed, 14 insertions(+), 3 deletions(-) > > diff --git a/lib/lockref.c b/lib/lockref.c > index 2afe4c5d8919..481b102a6476 100644 > --- a/lib/lockref.c > +++ b/lib/lockref.c > @@ -26,6 +26,17 @@ > } \ > } while (0) > > +/* > + * The compiler isn't smart enough to the the count > + * increment in the high 32 bits of the 64-bit value, > + * so do this optimization by hand. > + */ > +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64 > + #define LOCKREF_INC(n) ((n).lock_count += 1ul<<32) > +#else > + #define LOCKREF_INC(n) ((n).count++) > +#endif > + > #else > > #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0) > @@ -42,7 +53,7 @@ > void lockref_get(struct lockref *lockref) > { > CMPXCHG_LOOP( > - new.count++; > + LOCKREF_INC(new); > , > return; > ); > @@ -63,7 +74,7 @@ int lockref_get_not_zero(struct lockref *lockref) > int retval; > > CMPXCHG_LOOP( > - new.count++; > + LOCKREF_INC(new); > if (old.count <= 0) > return 0; > , > @@ -174,7 +185,7 @@ int lockref_get_not_dead(struct lockref *lockref) > int retval; > > CMPXCHG_LOOP( > - new.count++; > + LOCKREF_INC(new); > if (old.count < 0) > return 0; > ,