On Tue, 21 Nov 2023 at 23:19, Guo Ren <guoren@xxxxxxxxxx> wrote: > > We discussed x86 qspinlock code generation. It looked not too bad as I > thought because qspinlock_spin_value_unlocked is much cheaper than the > ticket-lock. But the riscv ticket-lock code generation is terrible > because of the shift left & right 16-bit. > https://lore.kernel.org/all/ZNG2tHFOABSXGCVi@xxxxxxxxx No, it's not the 16-bit shifts in the spin_value_unlocked() check, that just generates simple and straightforward code: a0: 0107569b srlw a3,a4,0x10 a4: 00c77733 and a4,a4,a2 a8: 04e69063 bne a3,a4,e8 <.L12> (plus two stupid instructions for generating the immediate in a2 for 0xffff, but hey, that's the usual insane RISC-V encoding thing - you can load a 20-bit U-immediate only shifted up by 12, if it's in the lower bits you're kind of screwed and limited to 12-bit immediates). The *bad* code generation is from the much simpler new.count++; which sadly neither gcc not clang is quite smart enough to understand that "hey, I can do that in 64 bits". It's incrementing the higher 32-bit word in a 64-bit union, and with a smarter compiler it *should* basically become lock_count += 1 << 32; but the compiler isn't that clever, so it splits the 64-bit word into two 32-bit words, increments one of them, and then merges the two words back into 64 bits: 98: 4207d693 sra a3,a5,0x20 9c: 02079713 sll a4,a5,0x20 a0: 0016869b addw a3,a3,1 a4: 02069693 sll a3,a3,0x20 a8: 02075713 srl a4,a4,0x20 ac: 00d76733 or a4,a4,a3 which is pretty sad. If you want to do the optimization that the compiler misses by hand, it would be something like the attached patch. NOTE! Very untested. But that *should* cause the compiler to just generate a single "add" instruction (in addition to generating the constant 0x100000000, of course). Of course, on a LL/SC architecture like RISC-V, in an *optimal* world, the whole sequence would actually be done with one single LL/SC, rather than the "load,add,cmpxchg" thing. But then you'd have to do absolutely everything by hand in assembly. Linus
lib/lockref.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/lib/lockref.c b/lib/lockref.c index 2afe4c5d8919..481b102a6476 100644 --- a/lib/lockref.c +++ b/lib/lockref.c @@ -26,6 +26,17 @@ } \ } while (0) +/* + * The compiler isn't smart enough to the the count + * increment in the high 32 bits of the 64-bit value, + * so do this optimization by hand. + */ +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64 + #define LOCKREF_INC(n) ((n).lock_count += 1ul<<32) +#else + #define LOCKREF_INC(n) ((n).count++) +#endif + #else #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0) @@ -42,7 +53,7 @@ void lockref_get(struct lockref *lockref) { CMPXCHG_LOOP( - new.count++; + LOCKREF_INC(new); , return; ); @@ -63,7 +74,7 @@ int lockref_get_not_zero(struct lockref *lockref) int retval; CMPXCHG_LOOP( - new.count++; + LOCKREF_INC(new); if (old.count <= 0) return 0; , @@ -174,7 +185,7 @@ int lockref_get_not_dead(struct lockref *lockref) int retval; CMPXCHG_LOOP( - new.count++; + LOCKREF_INC(new); if (old.count < 0) return 0; ,