On Wed, Nov 22, 2023 at 11:11:38AM -0800, Linus Torvalds wrote: > On Wed, 22 Nov 2023 at 09:52, Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > Still not actually tested, but the code generation on x86 looks > > reasonable, so it migth be worth looking at whether it helps the > > RISC-V case. > > Doing some more munging, and actually looking at RISC-V code > generation too (I obviously had to enable ARCH_USE_CMPXCHG_LOCKREF for > RISC-V). > > I get this: > > lockref_get: > addi sp,sp,-32 > sd s0,16(sp) > sd s1,8(sp) > sd ra,24(sp) > addi s0,sp,32 > li a1,65536 > ld a5,0(a0) > mv s1,a0 > addi a1,a1,-1 > li a0,100 > .L43: > sext.w a3,a5 > li a4,1 > srliw a2,a5,16 > and a3,a3,a1 > slli a4,a4,32 > bne a2,a3,.L49 > add a4,a5,a4 > 0: > lr.d a3, 0(s1) > bne a3, a5, 1f > sc.d.rl a2, a4, 0(s1) > bnez a2, 0b > fence rw, rw > 1: > bne a5,a3,.L52 > ld ra,24(sp) > ld s0,16(sp) > ld s1,8(sp) > addi sp,sp,32 > jr ra > ... > > so now that single update is indeed just one single instruction: > > add a4,a5,a4 > > is that "increment count in the high 32 bits". > > The ticket lock being unlocked checks are those > > li a1,65536 > sext.w a3,a5 > srliw a2,a5,16 > and a3,a3,a1 > bne a2,a3,.L49 > > instructions if I read it right. > > That actually looks fairly close to optimal, although the frame setup > is kind of sad. > > (The above does not include the "loop if the cmpxchg failed" part of > the code generation) > > Anyway, apart from enabling LOCKREF, the patch to get this for RISC-V > is attached. > > I'm not going to play with this any more, but you might want to check > whether this actually does work on RISC-V. > > Becaue I only looked at the code generation, I didn't actually look at > whether it *worked*. > > Linus > From 168f35850c15468941e597907e33daacd179d54a Mon Sep 17 00:00:00 2001 > From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > Date: Wed, 22 Nov 2023 09:33:29 -0800 > Subject: [PATCH] lockref: improve code generation for ref updates > > Our lockref data structure is two 32-bit words laid out next to each > other, combining the spinlock and the count into one entity that can be > accessed atomically together. > > In particular, the structure is laid out so that the count is the upper > 32 bit word (on little-endian), so that you can do basic arithmetic on > the count in 64 bits: instead of adding one to the 32-bit word, you can > just add a value shifted by 32 to the full 64-bit word. > > Sadly, neither gcc nor clang are quite clever enough to work that out on > their own, so this does that "manually". > > Also, try to do any compares against zero values, which generally > improves the code generation. So rather than check that the value was > at least 1 before a decrement, check that it's positive or zero after > the decrement. We don't worry about the overflow point in lockrefs. Tested-by: Guo Ren <guoren@xxxxxxxxxx> This patch could help riscv optimize 3 ALU instructions. Before the patch: 000000000000020c <lockref_get>: CMPXCHG_LOOP( 20c: 611c ld a5,0(a0) 000000000000020e <.LBB492>: 20e: 03079713 sll a4,a5,0x30 212: 0107d69b srlw a3,a5,0x10 216: 9341 srl a4,a4,0x30 218: 02e69663 bne a3,a4,244 <.L40> 000000000000021c <.LBB494>: 21c: 4207d693 sra a3,a5,0x20 -------+ 220: 02079713 sll a4,a5,0x20 | 224: 2685 addw a3,a3,1 | 226: 1682 sll a3,a3,0x20 | 228: 9301 srl a4,a4,0x20 | 22a: 8f55 or a4,a4,a3 -------+ 000000000000022c <.L0^B4>: 22c: 100536af lr.d a3,(a0) 230: 00f69763 bne a3,a5,23e <.L1^B5> 234: 1ae5362f sc.d.rl a2,a4,(a0) 238: fa75 bnez a2,22c <.L0^B4> 23a: 0330000f fence rw,rw After the patch: 000000000000020c <lockref_get>: CMPXCHG_LOOP( 20c: 611c ld a5,0(a0) 000000000000020e <.LBB526>: 20e: 03079713 sll a4,a5,0x30 212: 0107d69b srlw a3,a5,0x10 216: 9341 srl a4,a4,0x30 218: 02e69163 bne a3,a4,23a <.L40> 000000000000021c <.LBB528>: 21c: 4705 li a4,1 ------+ 21e: 1702 sll a4,a4,0x20 | 220: 973e add a4,a4,a5 ------+ 0000000000000222 <.L0^B4>: 222: 100536af lr.d a3,(a0) 226: 00f69763 bne a3,a5,234 <.L1^B5> 22a: 1ae5362f sc.d.rl a2,a4,(a0) 22e: fa75 bnez a2,222 <.L0^B4> 230: 0330000f fence rw,rw > > Cc: Guo Ren <guoren@xxxxxxxxxx> > Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > --- > lib/lockref.c | 29 ++++++++++++++++++++--------- > 1 file changed, 20 insertions(+), 9 deletions(-) > > diff --git a/lib/lockref.c b/lib/lockref.c > index 2afe4c5d8919..f3c30c538af1 100644 > --- a/lib/lockref.c > +++ b/lib/lockref.c > @@ -26,6 +26,17 @@ > } \ > } while (0) > > +/* > + * The compiler isn't smart enough to the the count > + * increment in the high 32 bits of the 64-bit value, > + * so do this optimization by hand. > + */ > +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64 > + #define LOCKREF_ADD(n,x) ((n).lock_count += (unsigned long)(x)<<32) > +#else > + #define LOCKREF_ADD(n,x) ((n).count += (unsigned long)(x)<<32) > +#endif > + > #else > > #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0) > @@ -42,7 +53,7 @@ > void lockref_get(struct lockref *lockref) > { > CMPXCHG_LOOP( > - new.count++; > + LOCKREF_ADD(new,1); > , > return; > ); > @@ -63,9 +74,9 @@ int lockref_get_not_zero(struct lockref *lockref) > int retval; > > CMPXCHG_LOOP( > - new.count++; > if (old.count <= 0) > return 0; > + LOCKREF_ADD(new,1); > , > return 1; > ); > @@ -91,8 +102,8 @@ int lockref_put_not_zero(struct lockref *lockref) > int retval; > > CMPXCHG_LOOP( > - new.count--; > - if (old.count <= 1) > + LOCKREF_ADD(new,-1); > + if (new.count <= 0) > return 0; > , > return 1; > @@ -119,8 +130,8 @@ EXPORT_SYMBOL(lockref_put_not_zero); > int lockref_put_return(struct lockref *lockref) > { > CMPXCHG_LOOP( > - new.count--; > - if (old.count <= 0) > + LOCKREF_ADD(new,-1); > + if (new.count < 0) > return -1; > , > return new.count; > @@ -137,8 +148,8 @@ EXPORT_SYMBOL(lockref_put_return); > int lockref_put_or_lock(struct lockref *lockref) > { > CMPXCHG_LOOP( > - new.count--; > - if (old.count <= 1) > + LOCKREF_ADD(new,-1); > + if (new.count <= 0) > break; > , > return 1; > @@ -174,9 +185,9 @@ int lockref_get_not_dead(struct lockref *lockref) > int retval; > > CMPXCHG_LOOP( > - new.count++; > if (old.count < 0) > return 0; > + LOCKREF_ADD(new,1); > , > return 1; > ); > -- > 2.43.0.5.g38fb137bdb >