zhenglifeng (A) <zhenglifeng1@xxxxxxxxxx> writes: > On 2024/9/26 7:24, Ankur Arora wrote: >> This patchset enables the cpuidle-haltpoll driver and its namesake >> governor on arm64. This is specifically interesting for KVM guests by >> reducing IPC latencies. >> >> Comparing idle switching latencies on an arm64 KVM guest with >> perf bench sched pipe: >> >> usecs/op %stdev >> >> no haltpoll (baseline) 13.48 +- 5.19% >> with haltpoll 6.84 +- 22.07% >> >> >> No change in performance for a similar test on x86: >> >> usecs/op %stdev >> >> haltpoll w/ cpu_relax() (baseline) 4.75 +- 1.76% >> haltpoll w/ smp_cond_load_relaxed() 4.78 +- 2.31% >> >> Both sets of tests were on otherwise idle systems with guest VCPUs >> pinned to specific PCPUs. One reason for the higher stdev on arm64 >> is that trapping of the WFE instruction by the host KVM is contingent >> on the number of tasks on the runqueue. >> >> Tomohiro Misono and Haris Okanovic also report similar latency >> improvements on Grace and Graviton systems (for v7) [1] [2]. >> >> The patch series is organized in three parts: >> >> - patch 1, reorganizes the poll_idle() loop, switching to >> smp_cond_load_relaxed() in the polling loop. >> Relatedly patches 2, 3 mangle the config option ARCH_HAS_CPU_RELAX, >> renaming it to ARCH_HAS_OPTIMIZED_POLL. >> >> - patches 4-6 reorganize the haltpoll selection and init logic >> to allow architecture code to select it. >> >> - and finally, patches 7-11 add the bits for arm64 support. >> >> What is still missing: this series largely completes the haltpoll side >> of functionality for arm64. There are, however, a few related areas >> that still need to be threshed out: >> >> - WFET support: WFE on arm64 does not guarantee that poll_idle() >> would terminate in halt_poll_ns. Using WFET would address this. >> - KVM_NO_POLL support on arm64 >> - KVM TWED support on arm64: allow the host to limit time spent in >> WFE. >> >> >> Changelog: >> >> v8: No logic changes. Largely respin of v7, with changes >> noted below: >> >> - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its >> own patch. >> (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL") >> >> - address comments simplifying arm64 support (Will Deacon) >> (patch-11 "arm64: support cpuidle-haltpoll") >> >> v7: No significant logic changes. Mostly a respin of v6. >> >> - minor cleanup in poll_idle() (Christoph Lameter) >> - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c >> (Tomohiro Misono) >> >> v6: >> >> - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL >> changes together (comment from Christoph Lameter) >> - threshes out the commit messages a bit more (comments from Christoph >> Lameter, Sudeep Holla) >> - also rework selection of cpuidle-haltpoll. Now selected based >> on the architectural selection of ARCH_CPUIDLE_HALTPOLL. >> - moved back to arch_haltpoll_want() (comment from Joao Martins) >> Also, arch_haltpoll_want() now takes the force parameter and is >> now responsible for the complete selection (or not) of haltpoll. >> - fixes the build breakage on i386 >> - fixes the cpuidle-haltpoll module breakage on arm64 (comment from >> Tomohiro Misono, Haris Okanovic) >> >> >> v5: >> - rework the poll_idle() loop around smp_cond_load_relaxed() (review >> comment from Tomohiro Misono.) >> - also rework selection of cpuidle-haltpoll. Now selected based >> on the architectural selection of ARCH_CPUIDLE_HALTPOLL. >> - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on >> arm64 now depends on the event-stream being enabled. >> - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic) >> - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL. >> >> v4 changes from v3: >> - change 7/8 per Rafael input: drop the parens and use ret for the final check >> - add 8/8 which renames the guard for building poll_state >> >> v3 changes from v2: >> - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig >> - add Ack-by from Rafael Wysocki on 2/7 >> >> v2 changes from v1: >> - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ >> (this improves by 50% at least the CPU cycles consumed in the tests above: >> 10,716,881,137 now vs 14,503,014,257 before) >> - removed the ifdef from patch 1 per RafaelW >> >> Please review. >> >> [1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ >> [2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@xxxxxxxxxx/ >> >> Ankur Arora (6): >> cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL >> cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL >> arm64: idle: export arch_cpu_idle >> arm64: select ARCH_HAS_OPTIMIZED_POLL >> cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64 >> arm64: support cpuidle-haltpoll >> >> Joao Martins (4): >> Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig >> cpuidle-haltpoll: define arch_haltpoll_want() >> governors/haltpoll: drop kvm_para_available() check >> arm64: define TIF_POLLING_NRFLAG >> >> Mihai Carabas (1): >> cpuidle/poll_state: poll via smp_cond_load_relaxed() >> >> arch/Kconfig | 3 +++ >> arch/arm64/Kconfig | 7 +++++++ >> arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++ >> arch/arm64/include/asm/thread_info.h | 2 ++ >> arch/arm64/kernel/idle.c | 1 + >> arch/x86/Kconfig | 5 ++--- >> arch/x86/include/asm/cpuidle_haltpoll.h | 1 + >> arch/x86/kernel/kvm.c | 13 ++++++++++++ >> drivers/acpi/processor_idle.c | 4 ++-- >> drivers/cpuidle/Kconfig | 5 ++--- >> drivers/cpuidle/Makefile | 2 +- >> drivers/cpuidle/cpuidle-haltpoll.c | 12 +----------- >> drivers/cpuidle/governors/haltpoll.c | 6 +----- >> drivers/cpuidle/poll_state.c | 22 +++++++++++++++------ >> drivers/idle/Kconfig | 1 + >> include/linux/cpuidle.h | 2 +- >> include/linux/cpuidle_haltpoll.h | 5 +++++ >> 17 files changed, 83 insertions(+), 32 deletions(-) >> create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h >> > > Hi Ankur, > > Thanks for the patches! > > We have tested these patches on our machine, with an adaptation of ACPI LPI > states rather than c-states. > > Include polling state, there would be three states to get in. Comparing idle > switching latencies of different state with perf bench sched pipe: > > usecs/op %stdev > > state0(polling state) 7.36 +- 0.35% > state1 8.78 +- 0.46% > state2 77.32 +- 5.50% > > It turns out that it works on our machine. > > Tested-by: Lifeng Zheng <zhenglifeng1@xxxxxxxxxx> Great. Thanks Lifeng. > The adaptation of ACPI LPI states is shown below as a patch. Feel free to > include this patch as part of your series, or I can also send it out after > your series being merged. Ah, so polling for the regular ACPI driver. From a quick look the patch looks good but this series is mostly focused on haltpoll so I think this patch can go in after. Please Cc me when you send it. Thanks Ankur > From: Lifeng Zheng <zhenglifeng1@xxxxxxxxxx> > > ACPI: processor_idle: Support polling state for LPI > > Initialize an optional polling state besides LPI states. > > Wrap up a new enter method to correctly reflect the actual entered state > when the polling state is enabled. > > Signed-off-by: Lifeng Zheng <zhenglifeng1@xxxxxxxxxx> > Reviewed-by: Jie Zhan <zhanjie9@xxxxxxxxxxxxx> > --- > drivers/acpi/processor_idle.c | 39 ++++++++++++++++++++++++++++++----- > 1 file changed, 34 insertions(+), 5 deletions(-) > > diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c > index 44096406d65d..d154b5d77328 100644 > --- a/drivers/acpi/processor_idle.c > +++ b/drivers/acpi/processor_idle.c > @@ -1194,20 +1194,46 @@ static int acpi_idle_lpi_enter(struct cpuidle_device *dev, > return -EINVAL; > } > > +/* To correctly reflect the entered state if the poll state is enabled. */ > +static int acpi_idle_lpi_enter_with_poll_state(struct cpuidle_device *dev, > + struct cpuidle_driver *drv, int index) > +{ > + int entered_state; > + > + if (unlikely(index < 1)) > + return -EINVAL; > + > + entered_state = acpi_idle_lpi_enter(dev, drv, index - 1); > + if (entered_state < 0) > + return entered_state; > + > + return entered_state + 1; > +} > + > static int acpi_processor_setup_lpi_states(struct acpi_processor *pr) > { > - int i; > + int i, count; > struct acpi_lpi_state *lpi; > struct cpuidle_state *state; > struct cpuidle_driver *drv = &acpi_idle_driver; > + typeof(state->enter) enter_method; > > if (!pr->flags.has_lpi) > return -EOPNOTSUPP; > > + if (IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL)) { > + cpuidle_poll_state_init(drv); > + count = 1; > + enter_method = acpi_idle_lpi_enter_with_poll_state; > + } else { > + count = 0; > + enter_method = acpi_idle_lpi_enter; > + } > + > for (i = 0; i < pr->power.count && i < CPUIDLE_STATE_MAX; i++) { > lpi = &pr->power.lpi_states[i]; > > - state = &drv->states[i]; > + state = &drv->states[count]; > snprintf(state->name, CPUIDLE_NAME_LEN, "LPI-%d", i); > strscpy(state->desc, lpi->desc, CPUIDLE_DESC_LEN); > state->exit_latency = lpi->wake_latency; > @@ -1215,11 +1241,14 @@ static int acpi_processor_setup_lpi_states(struct acpi_processor *pr) > state->flags |= arch_get_idle_state_flags(lpi->arch_flags); > if (i != 0 && lpi->entry_method == ACPI_CSTATE_FFH) > state->flags |= CPUIDLE_FLAG_RCU_IDLE; > - state->enter = acpi_idle_lpi_enter; > - drv->safe_state_index = i; > + state->enter = enter_method; > + drv->safe_state_index = count; > + count++; > + if (count == CPUIDLE_STATE_MAX) > + break; > } > > - drv->state_count = i; > + drv->state_count = count; > > return 0; > } -- ankur