On Thu, Nov 09, 2023 at 09:57:39PM -0800, Linus Torvalds wrote: > On Thu, 9 Nov 2023 at 20:20, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > > > FWIW, on top of current #work.dcache2 the following delta might be worth > > looking into. Not sure if it's less confusing that way, though - I'd been staring > > at that place for too long. Code generation is slightly suboptimal with recent > > gcc, but only marginally so. > > I doubt the pure ALU ops and a couple of extra conditional branches > (that _probably_ predict well) matter at all. > > Especially since this is all after lockref_put_return() has done that > locked cmpxchg, which *is* expensive. > > My main reaction is that we use hlist_bl_unhashed() for d_unhashed(), > and we *intentionally* make it separate from the actual unhasing: > > - ___d_drop() does the __hlist_bl_del() > > - but d_unhashed() does hlist_bl_unhashed(), which checks > d_hash.pprev == NULL, and that's done by __d_drop > > We even have a comment about this: > > * ___d_drop doesn't mark dentry as "unhashed" > * (dentry->d_hash.pprev will be LIST_POISON2, not NULL). > > and we depend on this in __d_move(), which will unhash things > temporarily, but not mark things unhashed, because they get re-hashed > again. Same goes for __d_add(). > > Anyway, what I'm actually getting at in a roundabout way is that maybe > we should make D_UNHASHED be another flag in d_flags, and *not* use > that d_hash.pprev field, and that would allow us to combine even more > of these tests in dput(), because now pretty much *all* of those > "retain_dentry()" checks would be about d_flags bits. > > Hmm? As it is, it has that odd combination of d_flags and that > d_unhashed() test, so it's testing two different fields. > > Anyway, I really don't think it matters much, but since you brought up > the whole suboptimal code generation.. > > I tried to look at dput() code generation, and it doesn't look > horrendous as-is in your dcache2 branch. > > If anything, the thing that hirs is the lockref_put_return() being > out-of-line even though this is basically the only caller, plus people > have pessimized the arch_spin_value_unlocked() implementation *again*, > so that it uses a volatile read, when the *WHOLE*POINT* of that > "VALUE" part of "value_unlocked()" is that we've already read the > value, and we should *not* re-read it. > > Damn. > > The bug seems to affect both the generic qspinlock code, and the > ticket-based one. > > For the ticket based ones, it's PeterZ and commit 1bce11126d57 > ("asm-generic: ticket-lock: New generic ticket-based spinlock"), which > does > > static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) > { > return !arch_spin_is_locked(&lock); > } > > where we've got that "lock" value, but then it takes the address of > it, and uses arch_spin_is_locked() on it, so now it will force a flush > to memory, and then an READ_ONCE() on it. > > And for the qspinlock code, we had a similar We discussed x86 qspinlock code generation. It looked not too bad as I thought because qspinlock_spin_value_unlocked is much cheaper than the ticket-lock. But the riscv ticket-lock code generation is terrible because of the shift left & right 16-bit. https://lore.kernel.org/all/ZNG2tHFOABSXGCVi@xxxxxxxxx > > static __always_inline int queued_spin_value_unlocked(struct qspinlock lock) > { > return !atomic_read(&lock.val); > } > > thing, where it does 'atomic_read()' on the value it was passed as an argument. > > Stupid, stupid. It's literally forcing a re-read of a value that is > guaranteed to be on stack. > > I know this worked at some point, but that may have been many years > ago, since I haven't looked at this part of lockref code generation in > ages. > > Anway, as a result now all the lockref functions will do silly "store > the old lockref value to memory, in order to read it again" dances in > that CMPXCHG_LOOP() loop. > > It literally makes that whole "is this an unlocked value" function > completely pointless. The *whole* and only point was "look, I already > loaded the value from memory, is this *VALUE* unlocked. > > Compared to that complete braindamage in the fast-path loop, the small > extra ALU ops in fast_dput() are nothing. > > Peter - those functions are done exactly the wrong way around. > arch_spin_is_locked() should be implemented using > arch_spin_value_unlocked(), not this way around. > > And the queued spinlocks should not do an atomic_read()of the argument > they get, they should just do "!lock.val.counter" > > So something like this should fix lockref. ENTIRELY UNTESTED, except > now the code generation of lockref_put_return() looks much better, > without a pointless flush to the stack, and now it has no pointless > stack frame as a result. > > Of course, it should probably be inlined, since it has only one user > (ok, two, since fast_dput() gets used twice), and that should make the > return value testing much better. > > Linus > include/asm-generic/qspinlock.h | 2 +- > include/asm-generic/spinlock.h | 17 +++++++++-------- > 2 files changed, 10 insertions(+), 9 deletions(-) > > diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h > index 995513fa2690..0655aa5b57b2 100644 > --- a/include/asm-generic/qspinlock.h > +++ b/include/asm-generic/qspinlock.h > @@ -70,7 +70,7 @@ static __always_inline int queued_spin_is_locked(struct qspinlock *lock) > */ > static __always_inline int queued_spin_value_unlocked(struct qspinlock lock) > { > - return !atomic_read(&lock.val); > + return !lock.val.counter; > } > > /** > diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h > index fdfebcb050f4..a35eda0ec2a2 100644 > --- a/include/asm-generic/spinlock.h > +++ b/include/asm-generic/spinlock.h > @@ -68,11 +68,17 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) > smp_store_release(ptr, (u16)val + 1); > } > > +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) > +{ > + u32 val = lock.counter; > + return ((val >> 16) == (val & 0xffff)); > +} > + > static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock) > { > - u32 val = atomic_read(lock); > - > - return ((val >> 16) != (val & 0xffff)); > + arch_spinlock_t val; > + val.counter = atomic_read(lock); > + return !arch_spin_value_unlocked(val); > } > > static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) > @@ -82,11 +88,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) > return (s16)((val >> 16) - (val & 0xffff)) > 1; > } > > -static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) > -{ > - return !arch_spin_is_locked(&lock); > -} > - > #include <asm/qrwlock.h> > > #endif /* __ASM_GENERIC_SPINLOCK_H */