Ankur Arora <ankur.a.arora@xxxxxxxxxx> writes: > Marc Zyngier <maz@xxxxxxxxxx> writes: > >> On Wed, 16 Oct 2024 22:55:09 +0100, >> Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote: >>> >>> >>> Marc Zyngier <maz@xxxxxxxxxx> writes: >>> >>> > On Thu, 26 Sep 2024 00:24:14 +0100, >>> > Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote: >>> >> >>> >> This patchset enables the cpuidle-haltpoll driver and its namesake >>> >> governor on arm64. This is specifically interesting for KVM guests by >>> >> reducing IPC latencies. >>> >> >>> >> Comparing idle switching latencies on an arm64 KVM guest with >>> >> perf bench sched pipe: >>> >> >>> >> usecs/op %stdev >>> >> >>> >> no haltpoll (baseline) 13.48 +- 5.19% >>> >> with haltpoll 6.84 +- 22.07% >>> >> >>> >> >>> >> No change in performance for a similar test on x86: >>> >> >>> >> usecs/op %stdev >>> >> >>> >> haltpoll w/ cpu_relax() (baseline) 4.75 +- 1.76% >>> >> haltpoll w/ smp_cond_load_relaxed() 4.78 +- 2.31% >>> >> >>> >> Both sets of tests were on otherwise idle systems with guest VCPUs >>> >> pinned to specific PCPUs. One reason for the higher stdev on arm64 >>> >> is that trapping of the WFE instruction by the host KVM is contingent >>> >> on the number of tasks on the runqueue. >>> > >>> > Sorry to state the obvious, but if that's the variable trapping of >>> > WFI/WFE is the cause of your trouble, why don't you simply turn it off >>> > (see 0b5afe05377d for the details)? Given that you pin your vcpus to >>> > physical CPUs, there is no need for any trapping. >>> >>> Good point. Thanks. That should help reduce the guessing games around >>> the variance in these tests. >> >> I'd be interested to find out whether there is still some benefit in >> this series once you disable the WFx trapping heuristics. > > The benefit of polling in idle is more than just avoiding the cost of > trapping and re-entering. The other benefit is that remote wakeups > can now be done just by setting need-resched, instead of sending an > IPI, and incurring the cost of handling the interrupt on the receiver > side. > > But let me get you some numbers with that. So, I ran the sched-pipe test with processes on VCPUs 4 and 5 with kvm-arm.wfi_trap_policy=notrap. # perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \ perf bench sched pipe -l 1000000 -c 4 # No haltpoll (and, no TIF_POLLING_NRFLAG): Performance counter stats for 'CPU(s) 4,5' (5 runs): 25,229.57 msec task-clock # 2.000 CPUs utilized ( +- 7.75% ) 45,821,250,284 cycles # 1.816 GHz ( +- 10.07% ) 26,557,496,665 instructions # 0.58 insn per cycle ( +- 0.21% ) 0 sched:sched_wake_idle_without_ipi # 0.000 /sec 12.615 +- 0.977 seconds time elapsed ( +- 7.75% ) # Haltpoll: Performance counter stats for 'CPU(s) 4,5' (5 runs): 15,131.58 msec task-clock # 2.000 CPUs utilized ( +- 10.00% ) 34,158,188,839 cycles # 2.257 GHz ( +- 6.91% ) 20,824,950,916 instructions # 0.61 insn per cycle ( +- 0.09% ) 1,983,822 sched:sched_wake_idle_without_ipi # 131.105 K/sec ( +- 0.78% ) 7.566 +- 0.756 seconds time elapsed ( +- 10.00% ) We get a decent boost just because we are executing ~20% fewer instructions. Not sure how the cpu frequency scaling works in a VM but we also run at a higher frequency. -- ankur