Re: lockless case of retain_dentry() (was Re: [PATCH 09/15] fold the call of retain_dentry() into fast_dput())

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 21 Nov 2023 at 23:19, Guo Ren <guoren@xxxxxxxxxx> wrote:
>
> We discussed x86 qspinlock code generation. It looked not too bad as I
> thought because qspinlock_spin_value_unlocked is much cheaper than the
> ticket-lock. But the riscv ticket-lock code generation is terrible
> because of the shift left & right 16-bit.
> https://lore.kernel.org/all/ZNG2tHFOABSXGCVi@xxxxxxxxx

No, it's not the 16-bit shifts in the spin_value_unlocked() check,
that just generates simple and straightforward code:

  a0:   0107569b                srlw    a3,a4,0x10
  a4:   00c77733                and     a4,a4,a2
  a8:   04e69063                bne     a3,a4,e8 <.L12>

(plus two stupid instructions for generating the immediate in a2 for
0xffff, but hey, that's the usual insane RISC-V encoding thing - you
can load a 20-bit U-immediate only shifted up by 12, if it's in the
lower bits you're kind of screwed and limited to 12-bit immediates).

The *bad* code generation is from the much simpler

        new.count++;

which sadly neither gcc not clang is quite smart enough to understand
that "hey, I can do that in 64 bits".

It's incrementing the higher 32-bit word in a 64-bit union, and with a
smarter compiler it *should* basically become

        lock_count += 1 << 32;

but the compiler isn't that clever, so it splits the 64-bit word into
two 32-bit words, increments one of them, and then merges the two
words back into 64 bits:

  98:   4207d693                sra     a3,a5,0x20
  9c:   02079713                sll     a4,a5,0x20
  a0:   0016869b                addw    a3,a3,1
  a4:   02069693                sll     a3,a3,0x20
  a8:   02075713                srl     a4,a4,0x20
  ac:   00d76733                or      a4,a4,a3

which is pretty sad.

If you want to do the optimization that the compiler misses by hand,
it would be something like the attached patch.

NOTE! Very untested. But that *should* cause the compiler to just
generate a single "add" instruction (in addition to generating the
constant 0x100000000, of course).

Of course, on a LL/SC architecture like RISC-V, in an *optimal* world,
the whole sequence would actually be done with one single LL/SC,
rather than the "load,add,cmpxchg" thing.

But then you'd have to do absolutely everything by hand in assembly.

                  Linus
 lib/lockref.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/lib/lockref.c b/lib/lockref.c
index 2afe4c5d8919..481b102a6476 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -26,6 +26,17 @@
 	}									\
 } while (0)
 
+/*
+ * The compiler isn't smart enough to the the count
+ * increment in the high 32 bits of the 64-bit value,
+ * so do this optimization by hand.
+ */
+#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64
+ #define LOCKREF_INC(n) ((n).lock_count += 1ul<<32)
+#else
+ #define LOCKREF_INC(n) ((n).count++)
+#endif
+
 #else
 
 #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0)
@@ -42,7 +53,7 @@
 void lockref_get(struct lockref *lockref)
 {
 	CMPXCHG_LOOP(
-		new.count++;
+		LOCKREF_INC(new);
 	,
 		return;
 	);
@@ -63,7 +74,7 @@ int lockref_get_not_zero(struct lockref *lockref)
 	int retval;
 
 	CMPXCHG_LOOP(
-		new.count++;
+		LOCKREF_INC(new);
 		if (old.count <= 0)
 			return 0;
 	,
@@ -174,7 +185,7 @@ int lockref_get_not_dead(struct lockref *lockref)
 	int retval;
 
 	CMPXCHG_LOOP(
-		new.count++;
+		LOCKREF_INC(new);
 		if (old.count < 0)
 			return 0;
 	,

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux