Okanovic, Haris <harisokn@xxxxxxxxxx> writes: > On Tue, 2024-04-02 at 16:17 -0700, Ankur Arora wrote: >> CAUTION: This email originated from outside of the organization. Do >> not click links or open attachments unless you can confirm the sender >> and know the content is safe. >> >> >> >> Mark Rutland <mark.rutland@xxxxxxx> writes: >> >> > On Mon, Apr 01, 2024 at 08:47:06PM -0500, Haris Okanovic wrote: >> > > An arm64 cpuidle driver with two states: (1) First polls for new >> > > runable >> > > tasks up to 100 us (by default) before (2) a wfi idle and awoken >> > > by >> > > interrupt (the current arm64 behavior). It allows CPUs to return >> > > from >> > > idle more quickly by avoiding the longer interrupt wakeup path, >> > > which >> > > may require EL1/EL2 transition in certain VM scenarios. >> > >> > Please start off with an explanation of the problem you're trying >> > to solve >> > (which IIUC is to wake up more quickly in certain cases), before >> > describing the >> > solution. That makes it *significantly* easier for people to review >> > this, since >> > once you have the problem statement in mind it's much easier to >> > understand how >> > the solution space follows from that. >> > >> > > Poll duration is optionally configured at load time via the >> > > poll_limit >> > > module parameter. >> > >> > Why should this be a configurable parameter? >> > >> > (note, at this point you haven't introduced any of the data below, >> > so the >> > trade-off isn't clear to anyone). >> > >> > > The default 100 us duration was experimentally chosen, by >> > > measuring QPS >> > > (queries per sec) of the MLPerf bert inference benchmark, which >> > > seems >> > > particularly susceptible to this change; see procedure below. 100 >> > > us is >> > > the inflection point where QPS stopped growing in a range of >> > > tested >> > > values. All results are from AWS m7g.16xlarge instances >> > > (Graviton3 SoC) >> > > with dedicated tenancy (dedicated hardware). >> > > >> > > > before | 10us | 25us | 50us | 100us | 125us | 150us | 200us | >> > > > 300us | >> > > > 5.87 | 5.91 | 5.96 | 6.01 | 6.06 | 6.07 | 6.06 | 6.06 | >> > > > 6.06 | >> > > >> > > Perf's scheduler benchmarks also improve with a range of >> > > poll_limit >> > > values >= 10 us. Higher limits produce near identical results >> > > within a >> > > 3% noise margin. The following tables are `perf bench sched` >> > > results, >> > > run times in seconds. >> > > >> > > `perf bench sched messaging -l 80000` >> > > > AWS instance | SoC | Before | After | % Change | >> > > > c6g.16xl (VM) | Graviton2 | 18.974 | 18.400 | none | >> > > > c7g.16xl (VM) | Graviton3 | 13.852 | 13.859 | none | >> > > > c6g.metal | Graviton2 | 17.621 | 16.744 | none | >> > > > c7g.metal | Graviton3 | 13.430 | 13.404 | none | >> > > >> > > `perf bench sched pipe -l 2500000` >> > > > AWS instance | SoC | Before | After | % Change | >> > > > c6g.16xl (VM) | Graviton2 | 30.158 | 15.181 | -50% | >> > > > c7g.16xl (VM) | Graviton3 | 18.289 | 12.067 | -34% | >> > > > c6g.metal | Graviton2 | 17.609 | 15.170 | -14% | >> > > > c7g.metal | Graviton3 | 14.103 | 12.304 | -13% | >> > > >> > > `perf bench sched seccomp-notify -l 2500000` >> > > > AWS instance | SoC | Before | After | % Change | >> > > > c6g.16xl (VM) | Graviton2 | 28.784 | 13.754 | -52% | >> > > > c7g.16xl (VM) | Graviton3 | 16.964 | 11.430 | -33% | >> > > > c6g.metal | Graviton2 | 15.717 | 13.536 | -14% | >> > > > c7g.metal | Graviton3 | 13.301 | 11.491 | -14% | >> > >> > Ok, so perf numbers for a busy workload go up. >> > >> > What happens for idle state residency on a mostly idle system? >> > >> > > Steps to run MLPerf bert inference on Ubuntu 22.04: >> > > sudo apt install build-essential python3 python3-pip >> > > pip install "pybind11[global]" tensorflow transformers >> > > export TF_ENABLE_ONEDNN_OPTS=1 >> > > export DNNL_DEFAULT_FPMATH_MODE=BF16 >> > > git clone https://github.com/mlcommons/inference.git --recursive >> > > cd inference >> > > git checkout v2.0 >> > > cd loadgen >> > > CFLAGS="-std=c++14" python3 setup.py bdist_wheel >> > > pip install dist/*.whl >> > > cd ../language/bert >> > > make setup >> > > python3 run.py --backend=tf --scenario=SingleStream >> > > >> > > Suggested-by: Ali Saidi <alisaidi@xxxxxxxxxx> >> > > Reviewed-by: Ali Saidi <alisaidi@xxxxxxxxxx> >> > > Reviewed-by: Geoff Blake <blakgeof@xxxxxxxxxx> >> > > Cc: Brian Silver <silverbr@xxxxxxxxxx> >> > > Signed-off-by: Haris Okanovic <harisokn@xxxxxxxxxx> >> > > --- >> > > drivers/cpuidle/Kconfig.arm | 13 ++ >> > > drivers/cpuidle/Makefile | 1 + >> > > drivers/cpuidle/cpuidle-arm-polling.c | 171 >> > > ++++++++++++++++++++++++++ >> > > 3 files changed, 185 insertions(+) >> > > create mode 100644 drivers/cpuidle/cpuidle-arm-polling.c >> > > >> > > diff --git a/drivers/cpuidle/Kconfig.arm >> > > b/drivers/cpuidle/Kconfig.arm >> > > index a1ee475d180d..484666dda38d 100644 >> > > --- a/drivers/cpuidle/Kconfig.arm >> > > +++ b/drivers/cpuidle/Kconfig.arm >> > > @@ -14,6 +14,19 @@ config ARM_CPUIDLE >> > > initialized by calling the CPU operations init idle hook >> > > provided by architecture code. >> > > >> > > +config ARM_POLL_CPUIDLE >> > > + bool "ARM64 CPU idle Driver with polling" >> > > + depends on ARM64 >> > > + depends on ARM_ARCH_TIMER_EVTSTREAM >> > > + select CPU_IDLE_MULTIPLE_DRIVERS >> > > + help >> > > + Select this to enable a polling cpuidle driver for ARM64: >> > > + The first state polls TIF_NEED_RESCHED for best latency on >> > > short >> > > + sleep intervals. The second state falls back to >> > > arch_cpu_idle() to >> > > + wait for interrupt. This is can be helpful in workloads >> > > that >> > > + frequently block/wake at short intervals or VMs where >> > > wakeup IPIs >> > > + are more expensive. >> > >> > Why is this a separate driver rather than an optional feature in >> > the existing >> > driver? >> > >> > The fact that this duplicates a bunch of code indicates to me that >> > this should >> > not be a separate driver. >> >> Also, the cpuidle-haltpoll driver is meant to do something quite >> similar. >> That driver polls adaptively based on the haltpoll governor's tuning >> of >> the polling period. >> >> However, cpuidle-haltpoll is currently x86 only. Mihai (also from >> Oracle) >> posted patches [1] adding support for ARM64. >> >> Haris, could you take a look at it and see if it does what you are >> looking for? The polling path in the linked version also uses >> smp_cond_load_relaxed() so even the mechanisms for both of these >> are fairly similar. > > Hi Ankur, > > I agree, except for that small bug in exit condition, your haltpoll > changes fundamentally do the same thing: Yup. Will address that bug and a few other things in the next version. >> @ int __cpuidle poll_idle(... >> - if (!(ret & _TIF_NEED_RESCHED)) >> + if (ret & _TIF_NEED_RESCHE > > I'll follow up with another patch for AWS Graviton when your team is > finished. > > Do you have a rough ETA of when your changes will land in master? That I guess would be determined by the maintainers, but I should be able to send it out the coming week. Thanks Ankur >> >> (I'll be sending out the next version shortly. Happy to Cc you if you >> would like to try that out.) > > Yes, please do! > > Thanks, > Haris Okanovic > >> >> Thanks >> Ankur >> >> [1] >> https://lore.kernel.org/lkml/1707982910-27680-1-git-send-email-mihai.carabas@xxxxxxxxxx/ >> >> > >> > > + >> > > config ARM_PSCI_CPUIDLE >> > > bool "PSCI CPU idle Driver" >> > > depends on ARM_PSCI_FW >> > > diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile >> > > index d103342b7cfc..23c21422792d 100644 >> > > --- a/drivers/cpuidle/Makefile >> > > +++ b/drivers/cpuidle/Makefile >> > > @@ -22,6 +22,7 @@ obj-$(CONFIG_ARM_U8500_CPUIDLE) += >> > > cpuidle-ux500.o >> > > obj-$(CONFIG_ARM_AT91_CPUIDLE) += cpuidle-at91.o >> > > obj-$(CONFIG_ARM_EXYNOS_CPUIDLE) += cpuidle-exynos.o >> > > obj-$(CONFIG_ARM_CPUIDLE) += cpuidle-arm.o >> > > +obj-$(CONFIG_ARM_POLL_CPUIDLE) += cpuidle-arm- >> > > polling.o >> > > obj-$(CONFIG_ARM_PSCI_CPUIDLE) += cpuidle-psci.o >> > > obj-$(CONFIG_ARM_PSCI_CPUIDLE_DOMAIN) += cpuidle-psci- >> > > domain.o >> > > obj-$(CONFIG_ARM_TEGRA_CPUIDLE) += cpuidle-tegra.o >> > > diff --git a/drivers/cpuidle/cpuidle-arm-polling.c >> > > b/drivers/cpuidle/cpuidle-arm-polling.c >> > > new file mode 100644 >> > > index 000000000000..bca128568114 >> > > --- /dev/null >> > > +++ b/drivers/cpuidle/cpuidle-arm-polling.c >> > > @@ -0,0 +1,171 @@ >> > > +// SPDX-License-Identifier: GPL-2.0 >> > > +/* >> > > + * ARM64 CPU idle driver using wfe polling >> > > + * >> > > + * Copyright 2024 Amazon.com, Inc. or its affiliates. All rights >> > > reserved. >> > > + * >> > > + * Authors: >> > > + * Haris Okanovic <harisokn@xxxxxxxxxx> >> > > + * Brian Silver <silverbr@xxxxxxxxxx> >> > > + * >> > > + * Based on cpuidle-arm.c >> > > + * Copyright (C) 2014 ARM Ltd. >> > > + * Author: Lorenzo Pieralisi <lorenzo.pieralisi@xxxxxxx> >> > > + */ >> > > + >> > > +#include <linux/cpu.h> >> > > +#include <linux/cpu_cooling.h> >> > > +#include <linux/cpuidle.h> >> > > +#include <linux/sched/clock.h> >> > > + >> > > +#include <asm/cpuidle.h> >> > > +#include <asm/readex.h> >> > > + >> > > +#include "dt_idle_states.h" >> > > + >> > > +/* Max duration of the wfe() poll loop in us, before >> > > transitioning to >> > > + * arch_cpu_idle()/wfi() sleep. >> > > + */ >> > >> > /* >> > * Comments should have the leading '/*' on a separate line. >> > * See >> > https://www.kernel.org/doc/html/v6.8/process/coding-style.html#commenting >> > */ >> > >> > > +#define DEFAULT_POLL_LIMIT_US 100 >> > > +static unsigned int poll_limit __read_mostly = >> > > DEFAULT_POLL_LIMIT_US; >> > > + >> > > +/* >> > > + * arm_idle_wfe_poll - Polls state in wfe loop until reschedule >> > > is >> > > + * needed or timeout >> > > + */ >> > > +static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device >> > > *dev, >> > > + struct cpuidle_driver *drv, int idx) >> > > +{ >> > > + u64 time_start, time_limit; >> > > + >> > > + time_start = local_clock(); >> > > + dev->poll_time_limit = false; >> > > + >> > > + local_irq_enable(); >> > >> > Why enable IRQs here? We don't do that in the regular cpuidle-arm >> > driver, nor >> > the cpuidle-psci driver, and there's no explanation here or in the >> > commit message. >> > >> > How does this interact with RCU? Is that still watching or are we >> > in an >> > extended quiescent state? For PSCI idle states we enter an EQS, and >> > it seems >> > like we probably should here... >> > >> > > + >> > > + if (current_set_polling_and_test()) >> > > + goto end; >> > > + >> > > + time_limit = cpuidle_poll_time(drv, dev); >> > > + >> > > + do { >> > > + // exclusive read arms the monitor for wfe >> > > + if (__READ_ONCE_EX(current_thread_info()->flags) & >> > > _TIF_NEED_RESCHED) >> > > + goto end; >> > > + >> > > + // may exit prematurely, see >> > > ARM_ARCH_TIMER_EVTSTREAM >> > > + wfe(); >> > > + } while (local_clock() - time_start < time_limit); >> > >> > .. and if the EVTSTREAM is disabled, we'll sit in WFE forever >> > rather than >> > entering a deeper idle state, which doesn't seem desirable. >> > >> > It's worth noting that now that we have WFET, we'll probably want >> > to disable >> > the EVTSTREAM by default at some point, at least in some >> > configurations, since >> > that'll be able to sit in a WFE state for longer while also >> > reliably waking up >> > when required. >> > >> > I suspect we want something like an smp_load_acquire_timeout() here >> > to do the >> > wait in arch code (allowing us to use WFET), and enabling this >> > state will >> > depend on either having WFET or EVTSTREAM. >> > >> > > + >> > > + dev->poll_time_limit = true; >> > > + >> > > +end: >> > > + current_clr_polling(); >> > > + return idx; >> > > +} >> > > + >> > > +/* >> > > + * arm_idle_wfi - Places cpu in lower power state until >> > > interrupt, >> > > + * a fallback to polling >> > > + */ >> > > +static int __cpuidle arm_idle_wfi(struct cpuidle_device *dev, >> > > + struct cpuidle_driver *drv, int idx) >> > > +{ >> > > + if (current_clr_polling_and_test()) { >> > > + local_irq_enable(); >> > > + return idx; >> > > + } >> > >> > Same as above, why enable IRQs here? >> > >> > > + arch_cpu_idle(); >> > > + return idx; >> > >> > .. and if we need to enable IRQs in the other cases above, why do >> > we *not* >> > need to enable them here? >> > >> > > +} >> > > + >> > > +static struct cpuidle_driver arm_poll_idle_driver __initdata = { >> > > + .name = "arm_poll_idle", >> > > + .owner = THIS_MODULE, >> > > + .states = { >> > > + { >> > > + .enter = arm_idle_wfe_poll, >> > > + .exit_latency = 0, >> > > + .target_residency = 0, >> > > + .exit_latency_ns = 0, >> > > + .power_usage = UINT_MAX, >> > > + .flags = >> > > CPUIDLE_FLAG_POLLING, >> > > + .name = "WFE", >> > > + .desc = "ARM WFE", >> > > + }, >> > > + { >> > > + .enter = arm_idle_wfi, >> > > + .exit_latency = >> > > DEFAULT_POLL_LIMIT_US, >> > > + .target_residency = >> > > DEFAULT_POLL_LIMIT_US, >> > > + .power_usage = UINT_MAX, >> > > + .name = "WFI", >> > > + .desc = "ARM WFI", >> > > + }, >> > > + }, >> > > + .state_count = 2, >> > > +}; >> > >> > How does this interact with the existing driver? >> > >> > How does DEFAULT_POLL_LIMIT_US compare with PSCI idle states? >> > >> > > + >> > > +/* >> > > + * arm_poll_init_cpu - Initializes arm cpuidle polling driver >> > > for one cpu >> > > + */ >> > > +static int __init arm_poll_init_cpu(int cpu) >> > > +{ >> > > + int ret; >> > > + struct cpuidle_driver *drv; >> > > + >> > > + drv = kmemdup(&arm_poll_idle_driver, sizeof(*drv), >> > > GFP_KERNEL); >> > > + if (!drv) >> > > + return -ENOMEM; >> > > + >> > > + drv->cpumask = (struct cpumask *)cpumask_of(cpu); >> > > + drv->states[1].exit_latency = poll_limit; >> > > + drv->states[1].target_residency = poll_limit; >> > > + >> > > + ret = cpuidle_register(drv, NULL); >> > > + if (ret) { >> > > + pr_err("failed to register driver: %d, cpu %d\n", >> > > ret, cpu); >> > > + goto out_kfree_drv; >> > > + } >> > > + >> > > + pr_info("registered driver cpu %d\n", cpu); >> > >> > This does not need to be printed for each CPU. >> > >> > Mark. >> > >> > > + >> > > + cpuidle_cooling_register(drv); >> > > + >> > > + return 0; >> > > + >> > > +out_kfree_drv: >> > > + kfree(drv); >> > > + return ret; >> > > +} >> > > + >> > > +/* >> > > + * arm_poll_init - Initializes arm cpuidle polling driver >> > > + */ >> > > +static int __init arm_poll_init(void) >> > > +{ >> > > + int cpu, ret; >> > > + struct cpuidle_driver *drv; >> > > + struct cpuidle_device *dev; >> > > + >> > > + for_each_possible_cpu(cpu) { >> > > + ret = arm_poll_init_cpu(cpu); >> > > + if (ret) >> > > + goto out_fail; >> > > + } >> > > + >> > > + return 0; >> > > + >> > > +out_fail: >> > > + pr_info("de-register all"); >> > > + while (--cpu >= 0) { >> > > + dev = per_cpu(cpuidle_devices, cpu); >> > > + drv = cpuidle_get_cpu_driver(dev); >> > > + cpuidle_unregister(drv); >> > > + kfree(drv); >> > > + } >> > > + >> > > + return ret; >> > > +} >> > > + >> > > +module_param(poll_limit, uint, 0444); >> > > +device_initcall(arm_poll_init); >> > > -- >> > > 2.34.1 >> > > >> > > >> >> >> -- >> ankur -- ankur
- References:
- [PATCH 1/3] arm64: Add TIF_POLLING_NRFLAG
- From: Haris Okanovic
- [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- From: Haris Okanovic
- Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- From: Mark Rutland
- Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- From: Ankur Arora
- Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- From: Okanovic, Haris
- [PATCH 1/3] arm64: Add TIF_POLLING_NRFLAG
- Prev by Date: Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- Next by Date: RE: [PATCH 2/3] arm64: add __READ_ONCE_EX()
- Previous by thread: Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- Next by thread: Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- Index(es):
![]() |