On Sun, Nov 26, 2023 at 11:39:37AM -0500, Guo Ren wrote: > On Wed, Nov 22, 2023 at 09:20:53AM -0800, Linus Torvalds wrote: > > On Tue, 21 Nov 2023 at 23:19, Guo Ren <guoren@xxxxxxxxxx> wrote: > > > > > > We discussed x86 qspinlock code generation. It looked not too bad as I > > > thought because qspinlock_spin_value_unlocked is much cheaper than the > > > ticket-lock. But the riscv ticket-lock code generation is terrible > > > because of the shift left & right 16-bit. > > > https://lore.kernel.org/all/ZNG2tHFOABSXGCVi@xxxxxxxxx > > > > No, it's not the 16-bit shifts in the spin_value_unlocked() check, > > that just generates simple and straightforward code: > > > > a0: 0107569b srlw a3,a4,0x10 > > a4: 00c77733 and a4,a4,a2 > > a8: 04e69063 bne a3,a4,e8 <.L12> > > > > (plus two stupid instructions for generating the immediate in a2 for > > 0xffff, but hey, that's the usual insane RISC-V encoding thing - you > > can load a 20-bit U-immediate only shifted up by 12, if it's in the > > lower bits you're kind of screwed and limited to 12-bit immediates). > > > > The *bad* code generation is from the much simpler > > > > new.count++; > > > > which sadly neither gcc not clang is quite smart enough to understand > > that "hey, I can do that in 64 bits". > > > > It's incrementing the higher 32-bit word in a 64-bit union, and with a > > smarter compiler it *should* basically become > > > > lock_count += 1 << 32; > > > > but the compiler isn't that clever, so it splits the 64-bit word into > > two 32-bit words, increments one of them, and then merges the two > > words back into 64 bits: > > > > 98: 4207d693 sra a3,a5,0x20 > > 9c: 02079713 sll a4,a5,0x20 > > a0: 0016869b addw a3,a3,1 > > a4: 02069693 sll a3,a3,0x20 > > a8: 02075713 srl a4,a4,0x20 > > ac: 00d76733 or a4,a4,a3 > > > > which is pretty sad. > 9c & a8 is for word-zero-extend; riscv would have zext.w in the future. > Your patch may improve above with: > li a4,1 > slli a4,a4,32 > add a4,a5,a4 > > v.s. > sra a3,a5,0x20 > zext.w a4,a5 > addw a3,a3,1 > or a4,a4,a3 > You win one instruction "or a4,a4,a3", which is less than one cycle. Sorry, I forgot "sll a3,a3,0x20", so it's 1.5 cycles, but it didn't affect my opinion here; local core operations are the lower optimization priority than memory transations. > > The zext.w is important, and it could replace sll+srl a lot, so I think > it's a current ISA design short. > > Here, what I want to improve is to prevent stack frame setup in the fast > path, and that's the most benefit my patch could give out. Unnecessary > memory access is the most important performance killer in SMP. > > My patch removes the stack frame setup from the fast path. > void lockref_get(struct lockref *lockref) > { > 78: 00053783 ld a5,0(a0) > 000000000000007c <.LBB212>: > 7c: 00010637 lui a2,0x10 > > 0000000000000080 <.LBE212>: > 80: 06400593 li a1,100 > > 0000000000000084 <.LBB216>: > 84: fff60613 add a2,a2,-1 # ffff <.LLST8+0xf4aa> > > 0000000000000088 <.L8>: > 88: 0007871b sext.w a4,a5 > > 000000000000008c <.LBB217>: > 8c: 0107d69b srlw a3,a5,0x10 > 90: 00c77733 and a4,a4,a2 > 94: 04e69063 bne a3,a4,d4 <.L12> --------+ > | > 0000000000000098 <.LBB218>: | > 98: 4207d693 sra a3,a5,0x20 | > 9c: 02079713 sll a4,a5,0x20 | > a0: 0016869b addw a3,a3,1 | > a4: 02069693 sll a3,a3,0x20 | > a8: 02075713 srl a4,a4,0x20 | > ac: 00d76733 or a4,a4,a3 | > | > 00000000000000b0 <.L0^B1>: | > b0: 100536af lr.d a3,(a0) | > b4: 00f69863 bne a3,a5,c4 <.L1^B1> | > b8: 1ae5382f sc.d.rl a6,a4,(a0) | > bc: fe081ae3 bnez a6,b0 <.L0^B1> | > c0: 0330000f fence rw,rw | > | > 00000000000000c4 <.L1^B1>: | > c4: 04d78a63 beq a5,a3,118 <.L18> | > | > 00000000000000c8 <.LBE228>: | > c8: fff5859b addw a1,a1,-1 | > | > 00000000000000cc <.LBB229>: | > cc: 00068793 mv a5,a3 | > | > 00000000000000d0 <.LBE229>: | > d0: fa059ce3 bnez a1,88 <.L8> | > | > 00000000000000d4 <.L12>: <--------------------------------------+ > { slow_path > d4: fe010113 add sp,sp,-32 > d8: 00113c23 sd ra,24(sp) > dc: 00813823 sd s0,16(sp) > e0: 02010413 add s0,sp,32 > > > > > > If you want to do the optimization that the compiler misses by hand, > > it would be something like the attached patch. > > > > NOTE! Very untested. But that *should* cause the compiler to just > > generate a single "add" instruction (in addition to generating the > > constant 0x100000000, of course). > > > > Of course, on a LL/SC architecture like RISC-V, in an *optimal* world, > > the whole sequence would actually be done with one single LL/SC, > > rather than the "load,add,cmpxchg" thing. > > > > But then you'd have to do absolutely everything by hand in assembly. > No, it's not worth to do that. > - There are only atomic primitives in Linux, but no ll/sc primitive in > the real world. The world belongs to AMO and the only usage of ll/sc > is to implement AMO and CAS. > - Using single ll/sc primitive instead of cmpxchg is similar to your > patch, and you may win 1 cycle or not. > - The critical work here are reducing bus transactions, preventing > cache dance, and forward progress guarantee. > > Here is my optimization advice: > > #define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > int retry = 100; \ > struct lockref old; \ > BUILD_BUG_ON(sizeof(old) != 8); \ > + prefetchw(lockref); \ > old.lock_count = READ_ONCE(lockref->lock_count); \ > while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > struct lockref new = old; \ > CODE \ > if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > &old.lock_count, \ > new.lock_count))) { \ > SUCCESS; \ > } \ > > Micro-arch could give prefetchw more guarantee: > - Prefetch.w must guarantee cache line exclusiveness even when a > shareable state cache line hits. > - Hold the exclusive cache line for several cycles until the next > store or timeout > - Mask interrupt during the holding cycles (Optional) > > The lockref slow path is killed in this micro-architecture, which > means there is no chance to execute the spinlock. > > I've written down more details in my ppt: > https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing > > This type of prefetchw could help large-size atomic operations within > one cache line. Compared to the transaction memory model, prefetchw > could give a forward progress guarantee and easier landing in Linux > without any new primitive. > > > > > Linus > > > lib/lockref.c | 17 ++++++++++++++--- > > 1 file changed, 14 insertions(+), 3 deletions(-) > > > > diff --git a/lib/lockref.c b/lib/lockref.c > > index 2afe4c5d8919..481b102a6476 100644 > > --- a/lib/lockref.c > > +++ b/lib/lockref.c > > @@ -26,6 +26,17 @@ > > } \ > > } while (0) > > > > +/* > > + * The compiler isn't smart enough to the the count > > + * increment in the high 32 bits of the 64-bit value, > > + * so do this optimization by hand. > > + */ > > +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64 > > + #define LOCKREF_INC(n) ((n).lock_count += 1ul<<32) > > +#else > > + #define LOCKREF_INC(n) ((n).count++) > > +#endif > > + > > #else > > > > #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0) > > @@ -42,7 +53,7 @@ > > void lockref_get(struct lockref *lockref) > > { > > CMPXCHG_LOOP( > > - new.count++; > > + LOCKREF_INC(new); > > , > > return; > > ); > > @@ -63,7 +74,7 @@ int lockref_get_not_zero(struct lockref *lockref) > > int retval; > > > > CMPXCHG_LOOP( > > - new.count++; > > + LOCKREF_INC(new); > > if (old.count <= 0) > > return 0; > > , > > @@ -174,7 +185,7 @@ int lockref_get_not_dead(struct lockref *lockref) > > int retval; > > > > CMPXCHG_LOOP( > > - new.count++; > > + LOCKREF_INC(new); > > if (old.count < 0) > > return 0; > > , > >