Re: CPU excessively long times between frequency scaling driver calls - bisected

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Fri, 4 Mar 2022 08:23:23 -0800

On Fri, Mar 04, 2022 at 01:13:44PM +0800, Feng Tang wrote:
> On Thu, Mar 03, 2022 at 01:02:01PM +0100, Rafael J. Wysocki wrote:
> > On Thu, Mar 3, 2022 at 6:27 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
> > >
> > > On Tue, Mar 01, 2022 at 08:06:24PM -0800, Doug Smythies wrote:
> > > > On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
> > > > >
> > > > > I guess the numbers above could be reduced still by using a P-state
> > > > > below the max non-turbo one as a limit.
> > > >
> > > > Yes, and for a test I did "rjw-3".
> > > >
> > > > > > overruns: 1042.
> > > > > > max overrun time: 9,769 uSec.
> > > > >
> > > > > This would probably get worse then, though.
> > > >
> > > > Yes, that was my expectation, but not what happened.
> > > >
> > > > rjw-3:
> > > > ave: 3.09 watts
> > > > min: 3.01 watts
> > > > max: 31.7 watts
> > > > ave freq: 2.42 GHz.
> > > > overruns: 12. (I did not expect this.)
> > > > Max overruns time: 621 uSec.
> > > >
> > > > Note 1: IRQ's increased by 74%. i.e. it was going in
> > > > and out of idle a lot more.
> > > >
> > > > Note 2: We know that processor package power
> > > > is highly temperature dependent. I forgot to let my
> > > > coolant cool adequately after the kernel compile,
> > > > and so had to throw out the first 4 power samples
> > > > (20 minutes).
> > > >
> > > > I retested both rjw-2 and rjw-3, but shorter tests
> > > > and got 0 overruns in both cases.
> > >
> > > One thought is can we consider trying the previous debug patch of
> > > calling the util_update when entering idle (time limited).
> > >
> > > In current code, the RT/CFS/Deadline class all have places to call
> > > cpufreq_update_util(), the patch will make sure it is called in all
> > > four classes, also it follows the principle of 'schedutil' of not
> > > introducing more system cost. And surely I could be missing some
> > > details here.
> > >
> > > Following is a cleaner version of the patch, and the code could be
> > > moved down to the internal loop of
> > >
> > >         while (!need_resched()) {
> > >
> > >         }
> > >
> > > Which will make it get called more frequently.
> > 
> > It will, but it's not necessary in all cases.  It only is necessary if
> > the tick is going to be stopped (because the tick will update the
> > P-state governor anyway if it runs).  However, at this point you don't
> > know what's going to happen to the tick.
> > 
> > Moreover, updating the P-state governor before going idle doesn't
> > really help,
> 
> >From Doug's previous test, the power consumption and the delay
> both improved with the debug patch.
> 
> > because the P-state programmed by it at this point may
> > very well be stale after getting out of the idle state, so instead of
> > doing a full update at this point, it should force a low P-state on
> > the way from idle.
> 
> Makes sense.
> 
> Paul has asked about the timer interupts, and here is some more info,
> when there is no active load in system, the longest interval of 4
> seconds between 2 timer interrupts comes from the kernel watchdog
> for hardware/software lockup detector, and every CPU will have this
> timer, which limits the maximum cpu idle duration to be 4 seconds.

And thank you for the info!  One way to reduce this overhead is to boot
with the watchdog_thresh kernel-boot parameter set to some value greater
than 10.  The downside is that HW/FW/NMI catastrophes will need to last
for more than 10 seconds before the system complains.  One approach
is to run with a small watchdog_thresh during testing and a larger one
in production.

							Thanx, Paul

> > > ---
> > >
> > > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > > index d17b0a5ce6ac..e12688036725 100644
> > > --- a/kernel/sched/idle.c
> > > +++ b/kernel/sched/idle.c
> > > @@ -258,15 +258,23 @@ static void cpuidle_idle_call(void)
> > >   *
> > >   * Called with polling cleared.
> > >   */
> > > +DEFINE_PER_CPU(u64, last_util_update_time);    /* in jiffies */
> > >  static void do_idle(void)
> > >  {
> > >         int cpu = smp_processor_id();
> > > +       u64 expire;
> > >
> > >         /*
> > >          * Check if we need to update blocked load
> > >          */
> > >         nohz_run_idle_balance(cpu);
> > >
> > > +       expire = __this_cpu_read(last_util_update_time) + HZ * 3;
> > > +       if (unlikely(time_is_before_jiffies((unsigned long)expire))) {
> > > +               cpufreq_update_util(this_rq(), 0);
> > 
> > And quite frankly I'm not sure if running cpufreq_update_util() from
> > here is safe.
> 
> I had that concern too :). Do you mean this is called when the local
> irq is enabled, and could be interrupted causing some issue?
> 
> Thanks,
> Feng
> 
> > > +               __this_cpu_write(last_util_update_time, get_jiffies_64());
> > > +       }
> > > +
> > >         /*
> > >          * If the arch has a polling bit, we maintain an invariant:
> > >          *
> > >