On Fri, Mar 04, 2022 at 01:13:44PM +0800, Feng Tang wrote: > On Thu, Mar 03, 2022 at 01:02:01PM +0100, Rafael J. Wysocki wrote: > > On Thu, Mar 3, 2022 at 6:27 AM Feng Tang <feng.tang@xxxxxxxxx> wrote: > > > > > > On Tue, Mar 01, 2022 at 08:06:24PM -0800, Doug Smythies wrote: > > > > On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote: > > > > > > > > > > I guess the numbers above could be reduced still by using a P-state > > > > > below the max non-turbo one as a limit. > > > > > > > > Yes, and for a test I did "rjw-3". > > > > > > > > > > overruns: 1042. > > > > > > max overrun time: 9,769 uSec. > > > > > > > > > > This would probably get worse then, though. > > > > > > > > Yes, that was my expectation, but not what happened. > > > > > > > > rjw-3: > > > > ave: 3.09 watts > > > > min: 3.01 watts > > > > max: 31.7 watts > > > > ave freq: 2.42 GHz. > > > > overruns: 12. (I did not expect this.) > > > > Max overruns time: 621 uSec. > > > > > > > > Note 1: IRQ's increased by 74%. i.e. it was going in > > > > and out of idle a lot more. > > > > > > > > Note 2: We know that processor package power > > > > is highly temperature dependent. I forgot to let my > > > > coolant cool adequately after the kernel compile, > > > > and so had to throw out the first 4 power samples > > > > (20 minutes). > > > > > > > > I retested both rjw-2 and rjw-3, but shorter tests > > > > and got 0 overruns in both cases. > > > > > > One thought is can we consider trying the previous debug patch of > > > calling the util_update when entering idle (time limited). > > > > > > In current code, the RT/CFS/Deadline class all have places to call > > > cpufreq_update_util(), the patch will make sure it is called in all > > > four classes, also it follows the principle of 'schedutil' of not > > > introducing more system cost. And surely I could be missing some > > > details here. > > > > > > Following is a cleaner version of the patch, and the code could be > > > moved down to the internal loop of > > > > > > while (!need_resched()) { > > > > > > } > > > > > > Which will make it get called more frequently. > > > > It will, but it's not necessary in all cases. It only is necessary if > > the tick is going to be stopped (because the tick will update the > > P-state governor anyway if it runs). However, at this point you don't > > know what's going to happen to the tick. > > > > Moreover, updating the P-state governor before going idle doesn't > > really help, > > >From Doug's previous test, the power consumption and the delay > both improved with the debug patch. > > > because the P-state programmed by it at this point may > > very well be stale after getting out of the idle state, so instead of > > doing a full update at this point, it should force a low P-state on > > the way from idle. > > Makes sense. > > Paul has asked about the timer interupts, and here is some more info, > when there is no active load in system, the longest interval of 4 > seconds between 2 timer interrupts comes from the kernel watchdog > for hardware/software lockup detector, and every CPU will have this > timer, which limits the maximum cpu idle duration to be 4 seconds. And thank you for the info! One way to reduce this overhead is to boot with the watchdog_thresh kernel-boot parameter set to some value greater than 10. The downside is that HW/FW/NMI catastrophes will need to last for more than 10 seconds before the system complains. One approach is to run with a small watchdog_thresh during testing and a larger one in production. Thanx, Paul > > > --- > > > > > > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c > > > index d17b0a5ce6ac..e12688036725 100644 > > > --- a/kernel/sched/idle.c > > > +++ b/kernel/sched/idle.c > > > @@ -258,15 +258,23 @@ static void cpuidle_idle_call(void) > > > * > > > * Called with polling cleared. > > > */ > > > +DEFINE_PER_CPU(u64, last_util_update_time); /* in jiffies */ > > > static void do_idle(void) > > > { > > > int cpu = smp_processor_id(); > > > + u64 expire; > > > > > > /* > > > * Check if we need to update blocked load > > > */ > > > nohz_run_idle_balance(cpu); > > > > > > + expire = __this_cpu_read(last_util_update_time) + HZ * 3; > > > + if (unlikely(time_is_before_jiffies((unsigned long)expire))) { > > > + cpufreq_update_util(this_rq(), 0); > > > > And quite frankly I'm not sure if running cpufreq_update_util() from > > here is safe. > > I had that concern too :). Do you mean this is called when the local > irq is enabled, and could be interrupted causing some issue? > > Thanks, > Feng > > > > + __this_cpu_write(last_util_update_time, get_jiffies_64()); > > > + } > > > + > > > /* > > > * If the arch has a polling bit, we maintain an invariant: > > > * > > >