Re: [PATCH 1/2] lockref: speculatively spin waiting for the lock to be released

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 12 Jun 2024 18:23:18 -0700

On Wed, 12 Jun 2024 at 17:12, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>
> While I did not try to figure out who transiently took the lock (it was
> something outside of the benchmark), I devised a trivial reproducer
> which triggers the problem almost every time: merely issue "ls" of the
> directory containing the tested file (in this case: "ls /tmp").

So I have no problem with your patch 2/2 - moving the lockref data
structure away from everything else that can be shared read-only makes
a ton of sense independently of anything else.

Except you also randomly increased a retry count in there, which makes no sense.

But this patch 1/2 makes me go "Eww, hacky hacky".

We already *have* the retry loop, it's just that currently it only
covers the cmpxchg failures.

The natural thing to do is to just make the "wait for unlocked" be
part of the same loop.

In fact, I have this memory of trying this originally, and it not
mattering and just making the code uglier, but that may be me
confusing myself. It's a *loong* time ago.

With the attached patch, lockref_get() (to pick one random case) ends
up looking like this:

        mov    (%rdi),%rax
        mov    $0x64,%ecx
  loop:
        test   %eax,%eax
        jne    locked
        mov    %rax,%rdx
        sar    $0x20,%rdx
        add    $0x1,%edx
        shl    $0x20,%rdx
        lock cmpxchg %rdx,(%rdi)
        jne    fail
        // SUCCESS
        ret
  locked:
        pause
        mov    (%rdi),%rax
  fail:
        sub    $0x1,%ecx
        jne    loop

(with the rest being the "take lock and go slow" case).

It seems much better to me to have *one* retry loop that handles both
the causes of failures.

Entirely untested, I only looked at the generated code and it looked
reasonable. The patch may be entirely broken for some random reason I
didn't think of.

And in case you wonder, that 'lockref_locked()' macro I introduce is
purely to make the code more readable. Without it, that one
conditional line ends up being insanely long, the macro is there just
to break things up into slightly more manageable chunks.

Mind testing this approach instead?

                 Linus
 lib/lockref.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/lib/lockref.c b/lib/lockref.c
index 2afe4c5d8919..70f38621901b 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -4,6 +4,9 @@
 
 #if USE_CMPXCHG_LOCKREF
 
+#define lockref_locked(l) \
+	unlikely(!arch_spin_value_unlocked((l).lock.rlock.raw_lock))
+
 /*
  * Note that the "cmpxchg()" reloads the "old" value for the
  * failure case.
@@ -13,7 +16,12 @@
 	struct lockref old;							\
 	BUILD_BUG_ON(sizeof(old) != 8);						\
 	old.lock_count = READ_ONCE(lockref->lock_count);			\
-	while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {  	\
+	do {									\
+		if (lockref_locked(old)) {					\
+			cpu_relax();						\
+			old.lock_count = READ_ONCE(lockref->lock_count);	\
+			continue;						\
+		}								\
 		struct lockref new = old;					\
 		CODE								\
 		if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,		\
@@ -21,9 +29,7 @@
 						 new.lock_count))) {		\
 			SUCCESS;						\
 		}								\
-		if (!--retry)							\
-			break;							\
-	}									\
+	} while (--retry);							\
 } while (0)
 
 #else