On Tue, 2024-04-02 at 16:17 -0700, Ankur Arora wrote: > CAUTION: This email originated from outside of the organization. Do > not click links or open attachments unless you can confirm the sender > and know the content is safe. > > > > Mark Rutland <mark.rutland@xxxxxxx> writes: > > > On Mon, Apr 01, 2024 at 08:47:06PM -0500, Haris Okanovic wrote: > > > An arm64 cpuidle driver with two states: (1) First polls for new > > > runable > > > tasks up to 100 us (by default) before (2) a wfi idle and awoken > > > by > > > interrupt (the current arm64 behavior). It allows CPUs to return > > > from > > > idle more quickly by avoiding the longer interrupt wakeup path, > > > which > > > may require EL1/EL2 transition in certain VM scenarios. > > > > Please start off with an explanation of the problem you're trying > > to solve > > (which IIUC is to wake up more quickly in certain cases), before > > describing the > > solution. That makes it *significantly* easier for people to review > > this, since > > once you have the problem statement in mind it's much easier to > > understand how > > the solution space follows from that. > > > > > Poll duration is optionally configured at load time via the > > > poll_limit > > > module parameter. > > > > Why should this be a configurable parameter? > > > > (note, at this point you haven't introduced any of the data below, > > so the > > trade-off isn't clear to anyone). > > > > > The default 100 us duration was experimentally chosen, by > > > measuring QPS > > > (queries per sec) of the MLPerf bert inference benchmark, which > > > seems > > > particularly susceptible to this change; see procedure below. 100 > > > us is > > > the inflection point where QPS stopped growing in a range of > > > tested > > > values. All results are from AWS m7g.16xlarge instances > > > (Graviton3 SoC) > > > with dedicated tenancy (dedicated hardware). > > > > > > > before | 10us | 25us | 50us | 100us | 125us | 150us | 200us | > > > > 300us | > > > > 5.87 | 5.91 | 5.96 | 6.01 | 6.06 | 6.07 | 6.06 | 6.06 | > > > > 6.06 | > > > > > > Perf's scheduler benchmarks also improve with a range of > > > poll_limit > > > values >= 10 us. Higher limits produce near identical results > > > within a > > > 3% noise margin. The following tables are `perf bench sched` > > > results, > > > run times in seconds. > > > > > > `perf bench sched messaging -l 80000` > > > > AWS instance | SoC | Before | After | % Change | > > > > c6g.16xl (VM) | Graviton2 | 18.974 | 18.400 | none | > > > > c7g.16xl (VM) | Graviton3 | 13.852 | 13.859 | none | > > > > c6g.metal | Graviton2 | 17.621 | 16.744 | none | > > > > c7g.metal | Graviton3 | 13.430 | 13.404 | none | > > > > > > `perf bench sched pipe -l 2500000` > > > > AWS instance | SoC | Before | After | % Change | > > > > c6g.16xl (VM) | Graviton2 | 30.158 | 15.181 | -50% | > > > > c7g.16xl (VM) | Graviton3 | 18.289 | 12.067 | -34% | > > > > c6g.metal | Graviton2 | 17.609 | 15.170 | -14% | > > > > c7g.metal | Graviton3 | 14.103 | 12.304 | -13% | > > > > > > `perf bench sched seccomp-notify -l 2500000` > > > > AWS instance | SoC | Before | After | % Change | > > > > c6g.16xl (VM) | Graviton2 | 28.784 | 13.754 | -52% | > > > > c7g.16xl (VM) | Graviton3 | 16.964 | 11.430 | -33% | > > > > c6g.metal | Graviton2 | 15.717 | 13.536 | -14% | > > > > c7g.metal | Graviton3 | 13.301 | 11.491 | -14% | > > > > Ok, so perf numbers for a busy workload go up. > > > > What happens for idle state residency on a mostly idle system? > > > > > Steps to run MLPerf bert inference on Ubuntu 22.04: > > > sudo apt install build-essential python3 python3-pip > > > pip install "pybind11[global]" tensorflow transformers > > > export TF_ENABLE_ONEDNN_OPTS=1 > > > export DNNL_DEFAULT_FPMATH_MODE=BF16 > > > git clone https://github.com/mlcommons/inference.git --recursive > > > cd inference > > > git checkout v2.0 > > > cd loadgen > > > CFLAGS="-std=c++14" python3 setup.py bdist_wheel > > > pip install dist/*.whl > > > cd ../language/bert > > > make setup > > > python3 run.py --backend=tf --scenario=SingleStream > > > > > > Suggested-by: Ali Saidi <alisaidi@xxxxxxxxxx> > > > Reviewed-by: Ali Saidi <alisaidi@xxxxxxxxxx> > > > Reviewed-by: Geoff Blake <blakgeof@xxxxxxxxxx> > > > Cc: Brian Silver <silverbr@xxxxxxxxxx> > > > Signed-off-by: Haris Okanovic <harisokn@xxxxxxxxxx> > > > --- > > > drivers/cpuidle/Kconfig.arm | 13 ++ > > > drivers/cpuidle/Makefile | 1 + > > > drivers/cpuidle/cpuidle-arm-polling.c | 171 > > > ++++++++++++++++++++++++++ > > > 3 files changed, 185 insertions(+) > > > create mode 100644 drivers/cpuidle/cpuidle-arm-polling.c > > > > > > diff --git a/drivers/cpuidle/Kconfig.arm > > > b/drivers/cpuidle/Kconfig.arm > > > index a1ee475d180d..484666dda38d 100644 > > > --- a/drivers/cpuidle/Kconfig.arm > > > +++ b/drivers/cpuidle/Kconfig.arm > > > @@ -14,6 +14,19 @@ config ARM_CPUIDLE > > > initialized by calling the CPU operations init idle hook > > > provided by architecture code. > > > > > > +config ARM_POLL_CPUIDLE > > > + bool "ARM64 CPU idle Driver with polling" > > > + depends on ARM64 > > > + depends on ARM_ARCH_TIMER_EVTSTREAM > > > + select CPU_IDLE_MULTIPLE_DRIVERS > > > + help > > > + Select this to enable a polling cpuidle driver for ARM64: > > > + The first state polls TIF_NEED_RESCHED for best latency on > > > short > > > + sleep intervals. The second state falls back to > > > arch_cpu_idle() to > > > + wait for interrupt. This is can be helpful in workloads > > > that > > > + frequently block/wake at short intervals or VMs where > > > wakeup IPIs > > > + are more expensive. > > > > Why is this a separate driver rather than an optional feature in > > the existing > > driver? > > > > The fact that this duplicates a bunch of code indicates to me that > > this should > > not be a separate driver. > > Also, the cpuidle-haltpoll driver is meant to do something quite > similar. > That driver polls adaptively based on the haltpoll governor's tuning > of > the polling period. > > However, cpuidle-haltpoll is currently x86 only. Mihai (also from > Oracle) > posted patches [1] adding support for ARM64. > > Haris, could you take a look at it and see if it does what you are > looking for? The polling path in the linked version also uses > smp_cond_load_relaxed() so even the mechanisms for both of these > are fairly similar. Hi Ankur, I agree, except for that small bug in exit condition, your haltpoll changes fundamentally do the same thing: > @ int __cpuidle poll_idle(... > - if (!(ret & _TIF_NEED_RESCHED)) > + if (ret & _TIF_NEED_RESCHE I'll follow up with another patch for AWS Graviton when your team is finished. Do you have a rough ETA of when your changes will land in master? > > (I'll be sending out the next version shortly. Happy to Cc you if you > would like to try that out.) Yes, please do! Thanks, Haris Okanovic > > Thanks > Ankur > > [1] > https://lore.kernel.org/lkml/1707982910-27680-1-git-send-email-mihai.carabas@xxxxxxxxxx/ > > > > > > + > > > config ARM_PSCI_CPUIDLE > > > bool "PSCI CPU idle Driver" > > > depends on ARM_PSCI_FW > > > diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile > > > index d103342b7cfc..23c21422792d 100644 > > > --- a/drivers/cpuidle/Makefile > > > +++ b/drivers/cpuidle/Makefile > > > @@ -22,6 +22,7 @@ obj-$(CONFIG_ARM_U8500_CPUIDLE) += > > > cpuidle-ux500.o > > > obj-$(CONFIG_ARM_AT91_CPUIDLE) += cpuidle-at91.o > > > obj-$(CONFIG_ARM_EXYNOS_CPUIDLE) += cpuidle-exynos.o > > > obj-$(CONFIG_ARM_CPUIDLE) += cpuidle-arm.o > > > +obj-$(CONFIG_ARM_POLL_CPUIDLE) += cpuidle-arm- > > > polling.o > > > obj-$(CONFIG_ARM_PSCI_CPUIDLE) += cpuidle-psci.o > > > obj-$(CONFIG_ARM_PSCI_CPUIDLE_DOMAIN) += cpuidle-psci- > > > domain.o > > > obj-$(CONFIG_ARM_TEGRA_CPUIDLE) += cpuidle-tegra.o > > > diff --git a/drivers/cpuidle/cpuidle-arm-polling.c > > > b/drivers/cpuidle/cpuidle-arm-polling.c > > > new file mode 100644 > > > index 000000000000..bca128568114 > > > --- /dev/null > > > +++ b/drivers/cpuidle/cpuidle-arm-polling.c > > > @@ -0,0 +1,171 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > +/* > > > + * ARM64 CPU idle driver using wfe polling > > > + * > > > + * Copyright 2024 Amazon.com, Inc. or its affiliates. All rights > > > reserved. > > > + * > > > + * Authors: > > > + * Haris Okanovic <harisokn@xxxxxxxxxx> > > > + * Brian Silver <silverbr@xxxxxxxxxx> > > > + * > > > + * Based on cpuidle-arm.c > > > + * Copyright (C) 2014 ARM Ltd. > > > + * Author: Lorenzo Pieralisi <lorenzo.pieralisi@xxxxxxx> > > > + */ > > > + > > > +#include <linux/cpu.h> > > > +#include <linux/cpu_cooling.h> > > > +#include <linux/cpuidle.h> > > > +#include <linux/sched/clock.h> > > > + > > > +#include <asm/cpuidle.h> > > > +#include <asm/readex.h> > > > + > > > +#include "dt_idle_states.h" > > > + > > > +/* Max duration of the wfe() poll loop in us, before > > > transitioning to > > > + * arch_cpu_idle()/wfi() sleep. > > > + */ > > > > /* > > * Comments should have the leading '/*' on a separate line. > > * See > > https://www.kernel.org/doc/html/v6.8/process/coding-style.html#commenting > > */ > > > > > +#define DEFAULT_POLL_LIMIT_US 100 > > > +static unsigned int poll_limit __read_mostly = > > > DEFAULT_POLL_LIMIT_US; > > > + > > > +/* > > > + * arm_idle_wfe_poll - Polls state in wfe loop until reschedule > > > is > > > + * needed or timeout > > > + */ > > > +static int __cpuidle arm_idle_wfe_poll(struct cpuidle_device > > > *dev, > > > + struct cpuidle_driver *drv, int idx) > > > +{ > > > + u64 time_start, time_limit; > > > + > > > + time_start = local_clock(); > > > + dev->poll_time_limit = false; > > > + > > > + local_irq_enable(); > > > > Why enable IRQs here? We don't do that in the regular cpuidle-arm > > driver, nor > > the cpuidle-psci driver, and there's no explanation here or in the > > commit message. > > > > How does this interact with RCU? Is that still watching or are we > > in an > > extended quiescent state? For PSCI idle states we enter an EQS, and > > it seems > > like we probably should here... > > > > > + > > > + if (current_set_polling_and_test()) > > > + goto end; > > > + > > > + time_limit = cpuidle_poll_time(drv, dev); > > > + > > > + do { > > > + // exclusive read arms the monitor for wfe > > > + if (__READ_ONCE_EX(current_thread_info()->flags) & > > > _TIF_NEED_RESCHED) > > > + goto end; > > > + > > > + // may exit prematurely, see > > > ARM_ARCH_TIMER_EVTSTREAM > > > + wfe(); > > > + } while (local_clock() - time_start < time_limit); > > > > .. and if the EVTSTREAM is disabled, we'll sit in WFE forever > > rather than > > entering a deeper idle state, which doesn't seem desirable. > > > > It's worth noting that now that we have WFET, we'll probably want > > to disable > > the EVTSTREAM by default at some point, at least in some > > configurations, since > > that'll be able to sit in a WFE state for longer while also > > reliably waking up > > when required. > > > > I suspect we want something like an smp_load_acquire_timeout() here > > to do the > > wait in arch code (allowing us to use WFET), and enabling this > > state will > > depend on either having WFET or EVTSTREAM. > > > > > + > > > + dev->poll_time_limit = true; > > > + > > > +end: > > > + current_clr_polling(); > > > + return idx; > > > +} > > > + > > > +/* > > > + * arm_idle_wfi - Places cpu in lower power state until > > > interrupt, > > > + * a fallback to polling > > > + */ > > > +static int __cpuidle arm_idle_wfi(struct cpuidle_device *dev, > > > + struct cpuidle_driver *drv, int idx) > > > +{ > > > + if (current_clr_polling_and_test()) { > > > + local_irq_enable(); > > > + return idx; > > > + } > > > > Same as above, why enable IRQs here? > > > > > + arch_cpu_idle(); > > > + return idx; > > > > .. and if we need to enable IRQs in the other cases above, why do > > we *not* > > need to enable them here? > > > > > +} > > > + > > > +static struct cpuidle_driver arm_poll_idle_driver __initdata = { > > > + .name = "arm_poll_idle", > > > + .owner = THIS_MODULE, > > > + .states = { > > > + { > > > + .enter = arm_idle_wfe_poll, > > > + .exit_latency = 0, > > > + .target_residency = 0, > > > + .exit_latency_ns = 0, > > > + .power_usage = UINT_MAX, > > > + .flags = > > > CPUIDLE_FLAG_POLLING, > > > + .name = "WFE", > > > + .desc = "ARM WFE", > > > + }, > > > + { > > > + .enter = arm_idle_wfi, > > > + .exit_latency = > > > DEFAULT_POLL_LIMIT_US, > > > + .target_residency = > > > DEFAULT_POLL_LIMIT_US, > > > + .power_usage = UINT_MAX, > > > + .name = "WFI", > > > + .desc = "ARM WFI", > > > + }, > > > + }, > > > + .state_count = 2, > > > +}; > > > > How does this interact with the existing driver? > > > > How does DEFAULT_POLL_LIMIT_US compare with PSCI idle states? > > > > > + > > > +/* > > > + * arm_poll_init_cpu - Initializes arm cpuidle polling driver > > > for one cpu > > > + */ > > > +static int __init arm_poll_init_cpu(int cpu) > > > +{ > > > + int ret; > > > + struct cpuidle_driver *drv; > > > + > > > + drv = kmemdup(&arm_poll_idle_driver, sizeof(*drv), > > > GFP_KERNEL); > > > + if (!drv) > > > + return -ENOMEM; > > > + > > > + drv->cpumask = (struct cpumask *)cpumask_of(cpu); > > > + drv->states[1].exit_latency = poll_limit; > > > + drv->states[1].target_residency = poll_limit; > > > + > > > + ret = cpuidle_register(drv, NULL); > > > + if (ret) { > > > + pr_err("failed to register driver: %d, cpu %d\n", > > > ret, cpu); > > > + goto out_kfree_drv; > > > + } > > > + > > > + pr_info("registered driver cpu %d\n", cpu); > > > > This does not need to be printed for each CPU. > > > > Mark. > > > > > + > > > + cpuidle_cooling_register(drv); > > > + > > > + return 0; > > > + > > > +out_kfree_drv: > > > + kfree(drv); > > > + return ret; > > > +} > > > + > > > +/* > > > + * arm_poll_init - Initializes arm cpuidle polling driver > > > + */ > > > +static int __init arm_poll_init(void) > > > +{ > > > + int cpu, ret; > > > + struct cpuidle_driver *drv; > > > + struct cpuidle_device *dev; > > > + > > > + for_each_possible_cpu(cpu) { > > > + ret = arm_poll_init_cpu(cpu); > > > + if (ret) > > > + goto out_fail; > > > + } > > > + > > > + return 0; > > > + > > > +out_fail: > > > + pr_info("de-register all"); > > > + while (--cpu >= 0) { > > > + dev = per_cpu(cpuidle_devices, cpu); > > > + drv = cpuidle_get_cpu_driver(dev); > > > + cpuidle_unregister(drv); > > > + kfree(drv); > > > + } > > > + > > > + return ret; > > > +} > > > + > > > +module_param(poll_limit, uint, 0444); > > > +device_initcall(arm_poll_init); > > > -- > > > 2.34.1 > > > > > > > > > -- > ankur
- Follow-Ups:
- Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- From: Ankur Arora
- Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- References:
- [PATCH 1/3] arm64: Add TIF_POLLING_NRFLAG
- From: Haris Okanovic
- [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- From: Haris Okanovic
- Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- From: Mark Rutland
- Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- From: Ankur Arora
- [PATCH 1/3] arm64: Add TIF_POLLING_NRFLAG
- Prev by Date: Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- Next by Date: Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- Previous by thread: Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- Next by thread: Re: [PATCH 3/3] arm64: cpuidle: Add arm_poll_idle
- Index(es):