On Thu, 30 Nov 2023 at 19:01, Guo Ren <guoren@xxxxxxxxxx> wrote: > > That needs the expensive mechanism DynAMO [1], but some power-efficient > core lacks the capability. Yes, powerful OoO hardware could virtually > satisfy you by a minimum number of retries, but why couldn't we > explicitly tell hardware for "prefetchw"? Because every single time we've had a prefetch in the kernel, it has caused problems. A bit like cpu_relax() - these things get added for random hardware where it helps, and then a few years later it turns out that it hurts almost everywhere else. We've had particular problems with 'prefetch' because it turns out that (a) nobody sane uses them so (b) hardware is often buggy. And here "buggy" may be just performance (ie "prefetch actually stalls on TLB lookup" etc broken behavior that means that prefetch is not even remotely like a no-op that just hints to the cache subsystem), but sometimes even in actual semantics (ie "prefetch causes spurious faulting behavior") > Advanced hardware would treat cmpxchg as interconnect transactions when > cache miss(far atomic), which means L3 cache wouldn't return a unique > cacheline even when cmpxchg fails. The cmpxchg loop would continue to > read data bypassing the L1/L2 cache, which means every failure cmpxchg > is a cache-miss read. Honestly, I wouldn't call that "advanced hardware". I would call that ridiculous. If the cmpxchg isn't guaranteed to make progress, then the cmpxchg is broken. It's really that simple. It does sound like on your hardware, maybe you just want to make the RISC-V cmpxchg function always do a "prefetchw" if the 'sc.d' fails, something like "0: lr.w %0, %2\n" \ " bne %0, %z3, 1f\n" \ " sc.w %1, %z4, %2\n" \ - " bnez %1, 0b\n" \ + " beqz %1, 1f\n" \ + " prefetchw %2\n" \ + " j 0b\n" \ "1:\n" \ (quick entirely untested hack, you get the idea). A better implementation might use "asm goto" and expose the different error cases to the compiler so that it can move things around, but I'm not convinced it's worth the effort. But no, we're *not* adding a prefetchw to generic code just because apparently some RISC-V code is doing bad things. You need to keep workarounds for RISC-V behavior to RISC-V. And yes, the current "retry count" in our lockref implementation comes from another "some hardware does bad things for cmpxchg". But that workaround at most causes a few extra (regular) ALU instructions, and while not optimal, it's at least not going to cause any bigger problems. Linus