On Fri, Aug 4, 2023 at 2:24 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Fri, Aug 04, 2023 at 10:17:35AM -0400, Guo Ren wrote: > > > > See, this is where the ARM64 WFE would come in handy; I don't suppose > > > RISC-V has anything like that? > > Em... arm64 smp_cond_load only could save power consumption or release > > the pipeline resources of an SMT processor. When (Node1 cpu64) is in > > the WFE state, it still needs (Node0 cpu1) to write the value to give > > a cross-NUMA signal. So I didn't see what WFE related to reducing > > cross-Numa transactions, or I missed something. Sorry > > The benefit is that WFE significantly reduces the memory traffic. Since > it 'suspends' the core and waits for a write-notification instead of > busy polling the memory location you get a ton less loads. Em... I had a different observation: When a long lock queue appeared by a store buffer delay problem in the lock torture test, we observed all interconnects get into a quiet state, and there was no more memory traffic. All the cores are loop-loading "different" cacheline from their L1 cache, caused by queued_spinlock. So I don't see any memory traffics on the bus. For the LL + WFE, AFAIK, LL is a load instruction that would grab the cacheline from the bus into the L1-cache and set the reservation set (arm may call it exclusive-monitor). If any cacheline invalidation requests (readunique/cleanunique/...) come in, WFE would retire, and the reservation set would be cleared. So from a cacheline perspective, there is no difference between "LL+WFE" and "looping loads." Let's see two scenarios of LL+WFE, multi-cores, and muti-threadings of one core: - In the multi-cores case, WFE didn't give any more benefits than the loop loading from my perspective. Because the only thing WFE could do is to "suspend core" (I borrowed your word here), but it can't be deep sleep because the response from WFE is the most prior thing. As you said, we should prevent "terribly contended" situations, so WFE must keep fast reactions in the pipeline, not deep sleep. That's WFI stuff. And loop loading also could reduce power consumption through the proper micro-arch design: When the pipeline gets into a loop loading state, the loop buffer mechanism start, no instructions fetch happens, the frontend component can suspend for a while, and the only working components are "loop buffer" and "LSU load path." Other components could suspend for a while. So loop loading is not as terrible as you thought. - In the multi-threading of one core case, introducing an ISA instruction (WFE) to solve the loop loading problem is worthwhile because the thread could release the resource of the processor's pipe line. > -- Best Regards Guo Ren