[ Added Anna-Maria who is doing some timer work as well ] On Sun, 7 May 2023 11:07:00 +0200 Andrea Righi <andrea.righi@xxxxxxxxxxxxx> wrote: > Overview: > > nohz_full is a feature that allows to reduce the number of CPU tick > interrupts, thereby improving energy efficiency and reducing kernel > jitter. Hmm, I never thought of NOHZ_FULL used for energy efficiency, as the CPU is still running user space code, and there's really nothing inherently more power consuming with the tick. > > This works by stopping the tick interrupts on the CPUs that are either > idle or that have only one runnable task on them (there is no reason to > periodically interrupt the execution of a single running task if none > else is waiting to acquire the same CPU). > > It is not possible to configure all the available CPUs to work in the > nohz_full mode, at least one non-adaptive-tick CPU must be periodically > interrupted to properly handle timekeeping tasks in the system (such as > the gettimeofday() syscall returning accurate values). Do we really need nohz_full, instead, I think you want to look at what Anna-Maria is doing with moving the timer "manager" around to make sure that the tick stays on busy CPUs. Again, nohz_full is not for power consumption savings, but instead to reduce kernel interruption in user space. > > However, under certain conditions, we may want to relax this constraint, > accepting potential time inaccuracies in the system, in order to provide > additional benefits in terms of power consumption, performance and/or > reduce kernel jitter even more. > > For this reason introduce the new parameter nohz_full_aggressive. > > This option allows to enforce nozh_full across all the CPUs (even the > timekeeping CPU) at the cost of having potential timer inaccuracies in > the system. > > Test: > > - Hardware: Dell XPS 13 7390 w/ 8 cores > > - Kernel is using CONFIG_HZ=1000 (worst case scenario in terms of > power consumption and kernel jitter) and nohz_full=all > > - Measure interrupts and power consumption when the system is idle and > with 2, 4 and 8 cpu hogs > > Result: > > The following numbers have been collected using turbostat and dstat > measuring the average over a 5min run for each test. > > irqs/sec idle 1 CPU hog 2 CPU hogs 4 CPU hogs 8 CPU hogs > ------------------------------------------------------ > nohz_full 1036.679 1047.522 1046.203 1048.590 1074.867 > nohz_full_aggressive 98.685 106.296 127.587 146.586 1062.277 > > Power (Watt) idle 1 CPU hog 2 CPU hogs 4 CPU hogs 8 CPU hogs > ------------------------------------------------------ > nohz_full 0.502 W 3.436 W 3.755 W 6.187 W 6.019 W > nohz_full_aggressive 0.301 W 2.372 W 2.372 W 6.005 W 6.016 W > > % power reduction 40.04% 30.97% 36.83% 2.94% 0.05% > Nice. Now I doubt this is acceptable considering the side effects that the timer inaccuracy can cause. I think this breaks some basic assumptions in both the kernel and user space. Now, I think what is really happening here is that you are somewhat simulating the results that Anna-Maria has indirectly. That is, you just prevent an idle CPU from waking up to handle interrupts when not needed. Anna-Maria, Do you have some patches that Andrea could test with? Thanks, -- Steve > Conclusion: > > nohz_full_aggressive used together with nohz_full=all allows to save > some energy when the system is idle or under low CPU usage (e.g., when > less than half of the CPUs are used). > > Under high CPU load conditions power consumption is pretty much > identical to nohz_full=all because the impact of the saved power/irqs on > the timekeeping CPU doesn't contribute very much to the total energy > consumption. > > However, enabling nohz_full_aggressive can lead to timing inaccuracies > in the system, because periodic ticks can be disabled also on the > timekeeping CPU. > > Note: > > I wrote this patch while I was stuck in the airport, because my flight > was delayed and I was trying to optimize the battery usage of my laptop > in more creative ways. Ultimately I ended up wasting a lot more energy > to test this patch, but at least the long wait wasn't too boring. > > Signed-off-by: Andrea Righi <andrea.righi@xxxxxxxxxxxxx> > --- > .../ABI/testing/sysfs-devices-system-cpu | 12 ++++++++++++ > .../admin-guide/kernel-parameters.txt | 7 +++++++ > Documentation/timers/no_hz.rst | 5 +++++ > drivers/base/cpu.c | 19 +++++++++++++++++++ > include/linux/tick.h | 7 +++++++ > kernel/time/hrtimer.c | 7 ++++++- > kernel/time/tick-sched.c | 16 +++++++++++++--- > 7 files changed, 69 insertions(+), 4 deletions(-) > > diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu > index f54867cadb0f..aa620e154d54 100644 > --- a/Documentation/ABI/testing/sysfs-devices-system-cpu > +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu > @@ -679,6 +679,18 @@ Description: > (RO) the list of CPUs that are in nohz_full mode. > These CPUs are set by boot parameter "nohz_full=". > > +What: /sys/devices/system/cpu/nohz_full_aggressive > +Date: Apr 2023 > +Contact: Linux kernel mailing list <linux-kernel@xxxxxxxxxxxxxxx> > +Description: > + (RW) enable/disable nohz_full also for the timekeeping CPU. > + > + WARNING: enabling this option can cause potential > + high-resolution timer inaccuracies in the system. > + > + This option can be set by boot parameter > + "nohz_full_aggressive". > + > What: /sys/devices/system/cpu/isolated > Date: Apr 2015 > Contact: Linux kernel mailing list <linux-kernel@xxxxxxxxxxxxxxx> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 9e5bab29685f..23c6fe20e067 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -3732,6 +3732,13 @@ > Note that this argument takes precedence over > the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option. > > + nohz_full_aggressive > + [KNL,BOOT,SMP,ISOL] allow to enable nohz_full also for > + the timekeeping CPU. > + > + WARNING: enabling this option can cause potential > + high-resolution timer inaccuracies in the system. > + > noinitrd [RAM] Tells the kernel not to load any configured > initial RAM disk. > > diff --git a/Documentation/timers/no_hz.rst b/Documentation/timers/no_hz.rst > index f8786be15183..aa9f79297d77 100644 > --- a/Documentation/timers/no_hz.rst > +++ b/Documentation/timers/no_hz.rst > @@ -136,6 +136,11 @@ error message, and the boot CPU will be removed from the mask. Note that > this means that your system must have at least two CPUs in order for > CONFIG_NO_HZ_FULL=y to do anything for you. > > +This constraint can be relaxed passing the parameter "nohz_full_aggressive". > +With this option enabled the timekeeping CPU can be also configured to use > +non-adaptive ticks, at the cost of having potential high-resolution timer > +inaccuracies and in the system. > + > Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded. > This is covered in the "RCU IMPLICATIONS" section below. > > diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c > index c1815b9dae68..b55d6111a733 100644 > --- a/drivers/base/cpu.c > +++ b/drivers/base/cpu.c > @@ -280,6 +280,24 @@ static ssize_t print_cpus_nohz_full(struct device *dev, > return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(tick_nohz_full_mask)); > } > static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL); > + > +static ssize_t > +nohz_full_aggressive_show(struct device *dev, struct device_attribute *attr, > + char *buf) > +{ > + return sysfs_emit(buf, "%d\n", tick_nohz_full_aggressive); > +} > + > +static ssize_t nohz_full_aggressive_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + if (kstrtobool(buf, &tick_nohz_full_aggressive)) > + return -EINVAL; > + return count; > +} > + > +static DEVICE_ATTR_RW(nohz_full_aggressive); > #endif > > static void cpu_device_release(struct device *dev) > @@ -468,6 +486,7 @@ static struct attribute *cpu_root_attrs[] = { > &dev_attr_isolated.attr, > #ifdef CONFIG_NO_HZ_FULL > &dev_attr_nohz_full.attr, > + &dev_attr_nohz_full_aggressive.attr, > #endif > #ifdef CONFIG_GENERIC_CPU_AUTOPROBE > &dev_attr_modalias.attr, > diff --git a/include/linux/tick.h b/include/linux/tick.h > index 9459fef5b857..8d557838b3f6 100644 > --- a/include/linux/tick.h > +++ b/include/linux/tick.h > @@ -176,6 +176,7 @@ static inline void tick_nohz_idle_stop_tick_protected(void) { } > > #ifdef CONFIG_NO_HZ_FULL > extern bool tick_nohz_full_running; > +extern bool tick_nohz_full_aggressive; > extern cpumask_var_t tick_nohz_full_mask; > > static inline bool tick_nohz_full_enabled(void) > @@ -186,6 +187,11 @@ static inline bool tick_nohz_full_enabled(void) > return tick_nohz_full_running; > } > > +static inline bool tick_nohz_full_aggressive_enabled(void) > +{ > + return !!tick_nohz_full_aggressive; > +} > + > /* > * Check if a CPU is part of the nohz_full subset. Arrange for evaluating > * the cpu expression (typically smp_processor_id()) _after_ the static > @@ -276,6 +282,7 @@ extern void __tick_nohz_task_switch(void); > extern void __init tick_nohz_full_setup(cpumask_var_t cpumask); > #else > static inline bool tick_nohz_full_enabled(void) { return false; } > +static inline bool tick_nohz_full_aggressive_enabled(void) { return false; } > static inline bool tick_nohz_full_cpu(int cpu) { return false; } > static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { } > > diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c > index e8c08292defc..b3f27c6c8475 100644 > --- a/kernel/time/hrtimer.c > +++ b/kernel/time/hrtimer.c > @@ -1866,7 +1866,12 @@ void hrtimer_interrupt(struct clock_event_device *dev) > else > expires_next = ktime_add(now, delta); > tick_program_event(expires_next, 1); > - pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta)); > + /* > + * This is a "normal" condition when nohz_full_aggressive mode is > + * enabled, so avoid printing this warning in this case. > + */ > + if (!tick_nohz_full_aggressive_enabled()) > + pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta)); > } > > /* called with interrupts disabled */ > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index 52254679ec48..8864066e4746 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -188,7 +188,8 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now) > */ > if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)) { > #ifdef CONFIG_NO_HZ_FULL > - WARN_ON_ONCE(tick_nohz_full_running); > + if (!tick_nohz_full_aggressive_enabled()) > + WARN_ON_ONCE(tick_nohz_full_running); > #endif > tick_do_timer_cpu = cpu; > } > @@ -250,6 +251,8 @@ cpumask_var_t tick_nohz_full_mask; > EXPORT_SYMBOL_GPL(tick_nohz_full_mask); > bool tick_nohz_full_running; > EXPORT_SYMBOL_GPL(tick_nohz_full_running); > +bool tick_nohz_full_aggressive; > +EXPORT_SYMBOL_GPL(tick_nohz_full_aggressive); > static atomic_t tick_dep_mask; > > static bool check_tick_dependency(atomic_t *dep) > @@ -524,6 +527,13 @@ void __tick_nohz_task_switch(void) > } > } > > +static int __init tick_nohz_full_aggressive_setup(char *str) > +{ > + tick_nohz_full_aggressive = true; > + return 1; > +} > +__setup("nohz_full_aggressive", tick_nohz_full_aggressive_setup); > + > /* Get the boot-time nohz CPU list from the kernel parameters. */ > void __init tick_nohz_full_setup(cpumask_var_t cpumask) > { > @@ -854,7 +864,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu) > * Otherwise we can sleep as long as we want. > */ > delta = timekeeping_max_deferment(); > - if (cpu != tick_do_timer_cpu && > + if ((tick_nohz_full_aggressive_enabled() || cpu != tick_do_timer_cpu) && > (tick_do_timer_cpu != TICK_DO_TIMER_NONE || !ts->do_timer_last)) > delta = KTIME_MAX; > > @@ -1073,7 +1083,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts) > if (unlikely(report_idle_softirq())) > return false; > > - if (tick_nohz_full_enabled()) { > + if (tick_nohz_full_enabled() && !tick_nohz_full_aggressive_enabled()) { > /* > * Keep the tick alive to guarantee timekeeping progression > * if there are full dynticks CPUs around