Re: lockless case of retain_dentry() (was Re: [PATCH 09/15] fold the call of retain_dentry() into fast_dput())

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 9 Nov 2023 21:57:39 -0800

On Thu, 9 Nov 2023 at 20:20, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>
>         FWIW, on top of current #work.dcache2 the following delta might be worth
> looking into.  Not sure if it's less confusing that way, though - I'd been staring
> at that place for too long.  Code generation is slightly suboptimal with recent
> gcc, but only marginally so.

I doubt the pure ALU ops and a couple of extra conditional branches
(that _probably_ predict well) matter at all.

Especially since this is all after lockref_put_return() has done that
locked cmpxchg, which *is* expensive.

My main reaction is that we use hlist_bl_unhashed() for d_unhashed(),
and we *intentionally* make it separate from the actual unhasing:

 - ___d_drop() does the __hlist_bl_del()

 - but d_unhashed() does hlist_bl_unhashed(), which checks
d_hash.pprev == NULL, and that's done by __d_drop

We even have a comment about this:

 * ___d_drop doesn't mark dentry as "unhashed"
 * (dentry->d_hash.pprev will be LIST_POISON2, not NULL).

and we depend on this in __d_move(), which will unhash things
temporarily, but not mark things unhashed, because they get re-hashed
again. Same goes for __d_add().

Anyway, what I'm actually getting at in a roundabout way is that maybe
we should make D_UNHASHED be another flag in d_flags, and *not* use
that d_hash.pprev field, and that would allow us to combine even more
of these tests in dput(), because now pretty much *all* of those
"retain_dentry()" checks would be about d_flags bits.

Hmm? As it is, it has that odd combination of d_flags and that
d_unhashed() test, so it's testing two different fields.

Anyway, I really don't think it matters much, but since you brought up
the whole suboptimal code generation..

I tried to look at dput() code generation, and it doesn't look
horrendous as-is in your dcache2 branch.

If anything, the thing that hirs is the lockref_put_return() being
out-of-line even though this is basically the only caller, plus people
have pessimized the arch_spin_value_unlocked() implementation *again*,
so that it uses a volatile read, when the *WHOLE*POINT* of that
"VALUE" part of "value_unlocked()" is that we've already read the
value, and we should *not* re-read it.

Damn.

The bug seems to affect both the generic qspinlock code, and the
ticket-based one.

For the ticket based ones, it's PeterZ and commit 1bce11126d57
("asm-generic: ticket-lock: New generic ticket-based spinlock"), which
does

  static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
  {
        return !arch_spin_is_locked(&lock);
  }

where we've got that "lock" value, but then it takes the address of
it, and uses arch_spin_is_locked() on it, so now it will force a flush
to memory, and then an READ_ONCE() on it.

And for the qspinlock code, we had a similar

  static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
  {
        return !atomic_read(&lock.val);
  }

thing, where it does 'atomic_read()' on the value it was passed as an argument.

Stupid, stupid. It's literally forcing a re-read of a value that is
guaranteed to be on stack.

I know this worked at some point, but that may have been many years
ago, since I haven't looked at this part of lockref code generation in
ages.

Anway, as a result now all the lockref functions will do silly "store
the old lockref value to memory, in order to read it again" dances in
that CMPXCHG_LOOP() loop.

It literally makes that whole "is this an unlocked value" function
completely pointless. The *whole* and only point was "look, I already
loaded the value from memory, is this *VALUE* unlocked.

Compared to that complete braindamage in the fast-path loop, the small
extra ALU ops in fast_dput() are nothing.

Peter - those functions are done exactly the wrong way around.
arch_spin_is_locked() should be implemented using
arch_spin_value_unlocked(), not this way around.

And the queued spinlocks should not do an atomic_read()of the argument
they get, they should just do "!lock.val.counter"

So something like this should fix lockref. ENTIRELY UNTESTED, except
now the code generation of lockref_put_return() looks much better,
without a pointless flush to the stack, and now it has no pointless
stack frame as a result.

Of course, it should probably be inlined, since it has only one user
(ok, two, since fast_dput() gets used twice), and that should make the
return value testing much better.

               Linus
 include/asm-generic/qspinlock.h |  2 +-
 include/asm-generic/spinlock.h  | 17 +++++++++--------
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 995513fa2690..0655aa5b57b2 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -70,7 +70,7 @@ static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
  */
 static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
 {
-	return !atomic_read(&lock.val);
+	return !lock.val.counter;
 }
 
 /**
diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h
index fdfebcb050f4..a35eda0ec2a2 100644
--- a/include/asm-generic/spinlock.h
+++ b/include/asm-generic/spinlock.h
@@ -68,11 +68,17 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 	smp_store_release(ptr, (u16)val + 1);
 }
 
+static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
+{
+	u32 val = lock.counter;
+	return ((val >> 16) == (val & 0xffff));
+}
+
 static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock)
 {
-	u32 val = atomic_read(lock);
-
-	return ((val >> 16) != (val & 0xffff));
+	arch_spinlock_t val;
+	val.counter = atomic_read(lock);
+	return !arch_spin_value_unlocked(val);
 }
 
 static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
@@ -82,11 +88,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
 	return (s16)((val >> 16) - (val & 0xffff)) > 1;
 }
 
-static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
-{
-	return !arch_spin_is_locked(&lock);
-}
-
 #include <asm/qrwlock.h>
 
 #endif /* __ASM_GENERIC_SPINLOCK_H */