On Sun, Nov 26, 2023 at 08:51:35AM -0800, Linus Torvalds wrote: > On Sun, 26 Nov 2023 at 08:39, Guo Ren <guoren@xxxxxxxxxx> wrote: > > > > Here is my optimization advice: > > > > #define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > int retry = 100; \ > > struct lockref old; \ > > BUILD_BUG_ON(sizeof(old) != 8); \ > > + prefetchw(lockref); \\ > > No. > > We're not adding software prefetches to generic code. Been there, done > that. They *never* improve performance on good hardware. They end up > helping on some random (usually particularly bad) microarchitecture, > and then they hurt everybody else. > > And the real optimization advice is: "don't run on crap hardware". > > It really is that simple. Good hardware does OoO and sees the future write. That needs the expensive mechanism DynAMO [1], but some power-efficient core lacks the capability. Yes, powerful OoO hardware could virtually satisfy you by a minimum number of retries, but why couldn't we explicitly tell hardware for "prefetchw"? Advanced hardware would treat cmpxchg as interconnect transactions when cache miss(far atomic), which means L3 cache wouldn't return a unique cacheline even when cmpxchg fails. The cmpxchg loop would continue to read data bypassing the L1/L2 cache, which means every failure cmpxchg is a cache-miss read. Because of the "new.count++"/CODE data dependency, the continuous cmpxchg requests must wait first finish. This will cause a gap between cmpxchg requests, which will cause most CPU's cmpxchgs continue failling during serious contention. cas: Compare-And-Swap L1&L2 L3 cache +------+ +----------- | CPU1 | wait | | cas2 |------>| CPU1_cas1 --+ +------+ | | +------+ | | | CPU2 | wait | | | cas2 |------>| CPU2_cas1 --+--> If queued with CPU1_cas1 CPU2_cas1 +------+ | | CPU3_cas1, and most of CPUs would +------+ | | fail and retry. | CPU3 | wait | | | cas2 |------>| CPU3_cas1---+ +------+ +---------- The entire system moves forward with inefficiency: - A large number of invalid read requests CPU->L3 - High power consumption - Poor performance But, the “far atomic” is suitable for scenarios where contention is not particularly serious. So it is reasonable to let the software give prompts. That is "prefetchw": - The prefetchw is the preparation of "load + cmpxchg loop." - The prefetchw is not for single AMO or CAS or Store. [1] https://dl.acm.org/doi/10.1145/3579371.3589065 > > > Micro-arch could give prefetchw more guarantee: > > Well, in practice, they never do, and in fact they are often buggy and > cause problems because they weren't actually tested very much. > > Linus >