Re: lockless case of retain_dentry() (was Re: [PATCH 09/15] fold the call of retain_dentry() into fast_dput())

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 1 Dec 2023 10:09:01 +0900

On Thu, 30 Nov 2023 at 19:01, Guo Ren <guoren@xxxxxxxxxx> wrote:
>
> That needs the expensive mechanism DynAMO [1], but some power-efficient
> core lacks the capability. Yes, powerful OoO hardware could virtually
> satisfy you by a minimum number of retries, but why couldn't we
> explicitly tell hardware for "prefetchw"?

Because every single time we've had a prefetch in the kernel, it has
caused problems. A bit like cpu_relax() - these things get added for
random hardware where it helps, and then a few years later it turns
out that it hurts almost everywhere else.

We've had particular problems with 'prefetch' because it turns out
that (a) nobody sane uses them so (b) hardware is often buggy. And
here "buggy" may be just performance (ie "prefetch actually stalls on
TLB lookup" etc broken behavior that means that prefetch is not even
remotely like a no-op that just hints to the cache subsystem), but
sometimes even in actual semantics (ie "prefetch causes spurious
faulting behavior")

> Advanced hardware would treat cmpxchg as interconnect transactions when
> cache miss(far atomic), which means L3 cache wouldn't return a unique
> cacheline even when cmpxchg fails. The cmpxchg loop would continue to
> read data bypassing the L1/L2 cache, which means every failure cmpxchg
> is a cache-miss read.

Honestly, I wouldn't call that "advanced hardware". I would call that
ridiculous.

If the cmpxchg isn't guaranteed to make progress, then the cmpxchg is
broken. It's really that simple.

It does sound like on your hardware, maybe you just want to make the
RISC-V cmpxchg function always do a "prefetchw" if the 'sc.d' fails,
something like

                        "0:     lr.w %0, %2\n"                          \
                        "       bne  %0, %z3, 1f\n"                     \
                        "       sc.w %1, %z4, %2\n"                     \
-                       "       bnez %1, 0b\n"                          \
+                       "       beqz %1, 1f\n"                          \
+                       "       prefetchw %2\n"                         \
+                       "       j 0b\n"                                 \
                        "1:\n"                                          \

(quick entirely untested hack, you get the idea). A better
implementation might use "asm goto" and expose the different error
cases to the compiler so that it can move things around, but I'm not
convinced it's worth the effort.

But no, we're *not* adding a prefetchw to generic code just because
apparently some RISC-V code is doing bad things. You need to keep
workarounds for RISC-V behavior to RISC-V.

And yes, the current "retry count" in our lockref implementation comes
from another "some hardware does bad things for cmpxchg". But that
workaround at most causes a few extra (regular) ALU instructions, and
while not optimal, it's at least not going to cause any bigger
problems.

           Linus