Re: CPU excessively long times between frequency scaling driver calls - bisected

Feng Tang <feng.tang@xxxxxxxxx> · Thu, 3 Mar 2022 13:27:27 +0800

On Tue, Mar 01, 2022 at 08:06:24PM -0800, Doug Smythies wrote:
> On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
> >
> > I guess the numbers above could be reduced still by using a P-state
> > below the max non-turbo one as a limit.
> 
> Yes, and for a test I did "rjw-3".
> 
> > > overruns: 1042.
> > > max overrun time: 9,769 uSec.
> >
> > This would probably get worse then, though.
> 
> Yes, that was my expectation, but not what happened.
> 
> rjw-3:
> ave: 3.09 watts
> min: 3.01 watts
> max: 31.7 watts
> ave freq: 2.42 GHz.
> overruns: 12. (I did not expect this.)
> Max overruns time: 621 uSec.
> 
> Note 1: IRQ's increased by 74%. i.e. it was going in
> and out of idle a lot more.
> 
> Note 2: We know that processor package power
> is highly temperature dependent. I forgot to let my
> coolant cool adequately after the kernel compile,
> and so had to throw out the first 4 power samples
> (20 minutes).
> 
> I retested both rjw-2 and rjw-3, but shorter tests
> and got 0 overruns in both cases.
 
One thought is can we consider trying the previous debug patch of
calling the util_update when entering idle (time limited).

In current code, the RT/CFS/Deadline class all have places to call
cpufreq_update_util(), the patch will make sure it is called in all
four classes, also it follows the principle of 'schedutil' of not
introducing more system cost. And surely I could be missing some
details here.

Following is a cleaner version of the patch, and the code could be
moved down to the internal loop of

	while (!need_resched()) {

	}

Which will make it get called more frequently.

---

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index d17b0a5ce6ac..e12688036725 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -258,15 +258,23 @@ static void cpuidle_idle_call(void)
  *
  * Called with polling cleared.
  */
+DEFINE_PER_CPU(u64, last_util_update_time);	/* in jiffies */
 static void do_idle(void)
 {
 	int cpu = smp_processor_id();
+	u64 expire;
 
 	/*
 	 * Check if we need to update blocked load
 	 */
 	nohz_run_idle_balance(cpu);
 
+	expire = __this_cpu_read(last_util_update_time) + HZ * 3;
+	if (unlikely(time_is_before_jiffies((unsigned long)expire))) {
+		cpufreq_update_util(this_rq(), 0);
+		__this_cpu_write(last_util_update_time, get_jiffies_64());
+	}
+
 	/*
 	 * If the arch has a polling bit, we maintain an invariant:
 	 *

Thanks,
Feng

> > ATM I'm not quite sure why this happens, but you seem to have some
> > insight into it, so it would help if you shared it.
> 
> My insight seems questionable.
> 
> My thinking was that one can not decide if the pstate needs to go
> down or not based on such a localized look. The risk being that the
> higher periodic load might suffer overruns. Since my first test did exactly
> that, I violated my own "repeat all tests 3 times before reporting rule".
> Now, I am not sure what is going on.
> I will need more time to acquire traces and dig into it.
> 
> I also did a 1 hour intel_pstate_tracer test, with rjw-2, on an idle system
> and saw several long durations. This was expected as this patch set
> wouldn't change durations by more than a few jiffies.
> 755 long durations (>6.1 seconds), and 327.7 seconds longest.
> 
> ... Doug