On Mon, 2024-04-08 at 11:46 -0700, Ankur Arora wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > Okanovic, Haris <harisokn@xxxxxxxxxx> writes: > > > On Fri, 2024-04-05 at 16:14 -0700, Ankur Arora wrote: > > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > > > > > > > > > Okanovic, Haris <harisokn@xxxxxxxxxx> writes: > > > > > > > On Thu, 2024-02-15 at 09:41 +0200, Mihai Carabas wrote: > > > > > cpu_relax on ARM64 does a simple "yield". Thus we replace it with > > > > > smp_cond_load_relaxed which basically does a "wfe". > > > > > > > > > > Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > > > > > Signed-off-by: Mihai Carabas <mihai.carabas@xxxxxxxxxx> > > > > > --- > > > > > drivers/cpuidle/poll_state.c | 15 ++++++++++----- > > > > > 1 file changed, 10 insertions(+), 5 deletions(-) > > > > > > > > > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c > > > > > index 9b6d90a72601..1e45be906e72 100644 > > > > > --- a/drivers/cpuidle/poll_state.c > > > > > +++ b/drivers/cpuidle/poll_state.c > > > > > @@ -13,6 +13,7 @@ > > > > > static int __cpuidle poll_idle(struct cpuidle_device *dev, > > > > > struct cpuidle_driver *drv, int index) > > > > > { > > > > > + unsigned long ret; > > > > > u64 time_start; > > > > > > > > > > time_start = local_clock_noinstr(); > > > > > @@ -26,12 +27,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, > > > > > > > > > > limit = cpuidle_poll_time(drv, dev); > > > > > > > > > > - while (!need_resched()) { > > > > > - cpu_relax(); > > > > > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) > > > > > - continue; > > > > > - > > > > > + for (;;) { > > > > > loop_count = 0; > > > > > + > > > > > + ret = smp_cond_load_relaxed(¤t_thread_info()->flags, > > > > > + VAL & _TIF_NEED_RESCHED || > > > > > + loop_count++ >= POLL_IDLE_RELAX_COUNT); > > > > > > > > Is it necessary to repeat this 200 times with a wfe poll? > > > > > > The POLL_IDLE_RELAX_COUNT is there because on x86 each cpu_relax() > > > iteration is much shorter. > > > > > > With WFE, it makes less sense. > > > > > > > Does kvm not implement a timeout period? > > > > > > Not yet, but it does become more useful after a WFE haltpoll is > > > available on ARM64. > > > > Note that kvm conditionally traps WFE and WFI based on number of host > > CPU tasks. VMs will sometimes see hardware behavior - potentially > > polling for a long time before entering WFI. > > > > https://elixir.bootlin.com/linux/latest/source/arch/arm64/kvm/arm.c#L459 > > Yeah. There was a discussion on this > https://lore.kernel.org/lkml/871qc6qufy.fsf@xxxxxxxxxx/. > > > > Haltpoll does have a timeout, which you should be able to tune via > > > /sys/module/haltpoll/parameters/ but that, of course, won't help here. > > > > > > > Could you make it configurable? This patch improves certain workloads > > > > on AWS Graviton instances as well, but blocks up to 6ms in 200 * 30us > > > > increments before going to wfi, which is a bit excessive. > > > > > > Yeah, this looks like a problem. We could solve it by making it an > > > architectural parameter. Though I worry about ARM platforms with > > > much smaller default timeouts. > > > The other possibility is using WFET in the primitive, but then we > > > have that dependency and that's a bigger change. > > > > See arm64's delay() for inspiration: > > > > https://elixir.bootlin.com/linux/v6.9-rc2/source/arch/arm64/lib/delay.c#L26 > > Sure, that part is straight-forward enough. However, this will need a fallback > the case when WFET is not available. And, because this path is used on x86, > so we need a cross platform smp_cond*timeout(). Though given that the x86 > version is based on cpu_relax() then that could just fold the sched_clock() > check in. I was trying to point out how delay() handles different configurations: It prefers WFET when available, falls back to WFE when event stream is available, and finally falls back to cpu_relax() as last resort. Same logic can apply here. The x86 case can always use cpu_relax() fallback, for same behavior as smp_cond_load_relaxed(). Re your concern about "ARM platforms with much smaller default timeouts": You could do something different when arch_timer_get_rate() is too small. Although I'm not sure this is a huge concern, given that delay() doesn't seem to care in the WFE case. -- Haris Okanovic > > Maybe another place to do this would be by KVM forcing a WFE timeout. Arguably > that is needed regardless of whether we use a smp_cond*timeout() or not. > > -- > ankur