On Fri, Dec 01, 2023 at 10:09:01AM +0900, Linus Torvalds wrote: > On Thu, 30 Nov 2023 at 19:01, Guo Ren <guoren@xxxxxxxxxx> wrote: > > > > That needs the expensive mechanism DynAMO [1], but some power-efficient > > core lacks the capability. Yes, powerful OoO hardware could virtually > > satisfy you by a minimum number of retries, but why couldn't we > > explicitly tell hardware for "prefetchw"? > > Because every single time we've had a prefetch in the kernel, it has > caused problems. A bit like cpu_relax() - these things get added for > random hardware where it helps, and then a few years later it turns > out that it hurts almost everywhere else. > > We've had particular problems with 'prefetch' because it turns out > that (a) nobody sane uses them so (b) hardware is often buggy. And > here "buggy" may be just performance (ie "prefetch actually stalls on > TLB lookup" etc broken behavior that means that prefetch is not even > remotely like a no-op that just hints to the cache subsystem), but > sometimes even in actual semantics (ie "prefetch causes spurious > faulting behavior") Thanks for sharing your experience, I now know the problem with generic prefetchw. But what to do with these codes? ➜ linux git:(master) ✗ grep prefetchw mm/ fs/ kernel/ -r mm/slub.c: prefetchw(object + s->offset); mm/slab.c: prefetchw(objp); mm/page_alloc.c: prefetchw(p); mm/page_alloc.c: prefetchw(p + 1); mm/vmscan.c:#define prefetchw_prev_lru_folio(_folio, _base, _field) mm/vmscan.c: prefetchw(&prev->_field); mm/vmscan.c:#define prefetchw_prev_lru_folio(_folio, _base, _field) do mm/vmscan.c: prefetchw_prev_lru_folio(folio, src, flags); fs/mpage.c: prefetchw(&folio->flags); fs/f2fs/data.c: prefetchw(&page->flags); fs/ext4/readpage.c: prefetchw(&folio->flags); kernel/bpf/cpumap.c: prefetchw(page); kernel/locking/qspinlock.c: prefetchw(next); ➜ linux git:(master) ✗ grep prefetchw drivers/ -r | wc -l 80 ... > > > Advanced hardware would treat cmpxchg as interconnect transactions when > > cache miss(far atomic), which means L3 cache wouldn't return a unique > > cacheline even when cmpxchg fails. The cmpxchg loop would continue to > > read data bypassing the L1/L2 cache, which means every failure cmpxchg > > is a cache-miss read. > > Honestly, I wouldn't call that "advanced hardware". I would call that > ridiculous. Ridiculous Hardware: When CAS fails, the hardware still keeps "far atomic" mode. Correct Hardware: When CAS fails, the hardware should change to "near-atomic," which means acquiring an exclusive cache line and making progress. > > If the cmpxchg isn't guaranteed to make progress, then the cmpxchg is > broken. It's really that simple. I totally agree, and it's a correct guide, Thx. > > It does sound like on your hardware, maybe you just want to make the > RISC-V cmpxchg function always do a "prefetchw" if the 'sc.d' fails, > something like > > "0: lr.w %0, %2\n" \ > " bne %0, %z3, 1f\n" \ > " sc.w %1, %z4, %2\n" \ > - " bnez %1, 0b\n" \ > + " beqz %1, 1f\n" \ > + " prefetchw %2\n" \ > + " j 0b\n" \ > "1:\n" \ I modify your code to guarantee the progress of the comparison failure situation: Final version (for easy read): "0: lr.w %0, %2\n" \ " bne %0, %z3, 2f\n" \ " sc.w %1, %z4, %2\n" \ " beqz %1, 1f\n" \ " prefetchw %2\n" \ " j 0b\n" \ "2:\n" \ " prefetchw %2\n" \ "1:\n" \ Diff version: "0: lr.w %0, %2\n" \ - " bne %0, %z3, 1f\n" \ + " bne %0, %z3, 2f\n" \ " sc.w %1, %z4, %2\n" \ - " bnez %1, 0b\n" \ + " beqz %1, 1f\n" \ + " prefetchw %2\n" \ + " j 0b\n" \ + "2:\n" \ + " prefetchw %2\n" \ "1:\n" \ > > (quick entirely untested hack, you get the idea). A better > implementation might use "asm goto" and expose the different error > cases to the compiler so that it can move things around, but I'm not > convinced it's worth the effort. > > But no, we're *not* adding a prefetchw to generic code just because > apparently some RISC-V code is doing bad things. You need to keep > workarounds for RISC-V behavior to RISC-V. > > And yes, the current "retry count" in our lockref implementation comes > from another "some hardware does bad things for cmpxchg". But that > workaround at most causes a few extra (regular) ALU instructions, and > while not optimal, it's at least not going to cause any bigger > problems. > > Linus >