Okanovic, Haris <harisokn@xxxxxxxxxx> writes: > On Thu, 2024-02-15 at 09:41 +0200, Mihai Carabas wrote: >> cpu_relax on ARM64 does a simple "yield". Thus we replace it with >> smp_cond_load_relaxed which basically does a "wfe". >> >> Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx> >> Signed-off-by: Mihai Carabas <mihai.carabas@xxxxxxxxxx> >> --- >> drivers/cpuidle/poll_state.c | 15 ++++++++++----- >> 1 file changed, 10 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c >> index 9b6d90a72601..1e45be906e72 100644 >> --- a/drivers/cpuidle/poll_state.c >> +++ b/drivers/cpuidle/poll_state.c >> @@ -13,6 +13,7 @@ >> static int __cpuidle poll_idle(struct cpuidle_device *dev, >> struct cpuidle_driver *drv, int index) >> { >> + unsigned long ret; >> u64 time_start; >> >> time_start = local_clock_noinstr(); >> @@ -26,12 +27,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, >> >> limit = cpuidle_poll_time(drv, dev); >> >> - while (!need_resched()) { >> - cpu_relax(); >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT) >> - continue; >> - >> + for (;;) { >> loop_count = 0; >> + >> + ret = smp_cond_load_relaxed(¤t_thread_info()->flags, >> + VAL & _TIF_NEED_RESCHED || >> + loop_count++ >= POLL_IDLE_RELAX_COUNT); > > Is it necessary to repeat this 200 times with a wfe poll? The POLL_IDLE_RELAX_COUNT is there because on x86 each cpu_relax() iteration is much shorter. With WFE, it makes less sense. > Does kvm not implement a timeout period? Not yet, but it does become more useful after a WFE haltpoll is available on ARM64. Haltpoll does have a timeout, which you should be able to tune via /sys/module/haltpoll/parameters/ but that, of course, won't help here. > Could you make it configurable? This patch improves certain workloads > on AWS Graviton instances as well, but blocks up to 6ms in 200 * 30us > increments before going to wfi, which is a bit excessive. Yeah, this looks like a problem. We could solve it by making it an architectural parameter. Though I worry about ARM platforms with much smaller default timeouts. The other possibility is using WFET in the primitive, but then we have that dependency and that's a bigger change. Will address this in the next version. Thanks for pointing this out. -- ankur