Re: lockless case of retain_dentry() (was Re: [PATCH 09/15] fold the call of retain_dentry() into fast_dput())

Guo Ren <guoren@xxxxxxxxxx> · Thu, 30 Nov 2023 05:00:54 -0500

On Sun, Nov 26, 2023 at 08:51:35AM -0800, Linus Torvalds wrote:
> On Sun, 26 Nov 2023 at 08:39, Guo Ren <guoren@xxxxxxxxxx> wrote:
> >
> > Here is my optimization advice:
> >
> > #define CMPXCHG_LOOP(CODE, SUCCESS) do {                                        \
> >         int retry = 100;                                                        \
> >         struct lockref old;                                                     \
> >         BUILD_BUG_ON(sizeof(old) != 8);                                         \
> > +       prefetchw(lockref);                                                     \\
> 
> No.
> 
> We're not adding software prefetches to generic code. Been there, done
> that. They *never* improve performance on good hardware. They end up
> helping on some random (usually particularly bad) microarchitecture,
> and then they hurt everybody else.
> 
> And the real optimization advice is: "don't run on crap hardware".
> 
> It really is that simple. Good hardware does OoO and sees the future write.
That needs the expensive mechanism DynAMO [1], but some power-efficient
core lacks the capability. Yes, powerful OoO hardware could virtually
satisfy you by a minimum number of retries, but why couldn't we
explicitly tell hardware for "prefetchw"?

Advanced hardware would treat cmpxchg as interconnect transactions when
cache miss(far atomic), which means L3 cache wouldn't return a unique
cacheline even when cmpxchg fails. The cmpxchg loop would continue to
read data bypassing the L1/L2 cache, which means every failure cmpxchg
is a cache-miss read. Because of the "new.count++"/CODE data dependency,
the continuous cmpxchg requests must wait first finish. This will cause
a gap between cmpxchg requests, which will cause most CPU's cmpxchgs
continue failling during serious contention.

   cas: Compare-And-Swap

   L1&L2          L3 cache
 +------+       +-----------
 | CPU1 | wait  |
 | cas2 |------>| CPU1_cas1 --+
 +------+       |             |
 +------+       |             |
 | CPU2 | wait  |             |
 | cas2 |------>| CPU2_cas1 --+--> If queued with CPU1_cas1 CPU2_cas1
 +------+       |             |    CPU3_cas1, and most of CPUs would
 +------+       |             |    fail and retry.
 | CPU3 | wait  |             |
 | cas2 |------>| CPU3_cas1---+
 +------+       +----------

The entire system moves forward with inefficiency:
 - A large number of invalid read requests CPU->L3
 - High power consumption
 - Poor performance

But, the “far atomic” is suitable for scenarios where contention is
not particularly serious. So it is reasonable to let the software give
prompts. That is "prefetchw":
 - The prefetchw is the preparation of "load + cmpxchg loop."
 - The prefetchw is not for single AMO or CAS or Store.

[1] https://dl.acm.org/doi/10.1145/3579371.3589065

> 
> > Micro-arch could give prefetchw more guarantee:
> 
> Well, in practice, they never do, and in fact they are often buggy and
> cause problems because they weren't actually tested very much.
> 
>                  Linus
>