Re: CPU excessively long times between frequency scaling driver calls - bisected

Feng Tang <feng.tang@xxxxxxxxx> · Wed, 23 Feb 2022 10:49:30 +0800

Hi Rafael,

On Tue, Feb 22, 2022 at 07:04:32PM +0100, Rafael J. Wysocki wrote:
> On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
> >
> > On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:
> > > Hi Doug,
> > >
> > > I think you use CONFIG_NO_HZ_FULL.
> > > Here we are getting callback from scheduler. Can we check that if
> > > scheduler woke up on those CPUs?
> > > We can run "trace-cmd -e sched" and check in kernel shark if there is
> > > similar gaps in activity.
> >
> > Srinivas analyzed the scheduler trace data from trace-cmd, and thought is
> > related with the cpufreq callback is not called timeley from scheduling
> > events:
> >
> > "
> > I mean we ignore the callback when the target CPU is not a local CPU as
> > we have to do IPI to adjust MSRs.
> > This will happen many times when sched_wake will wake up a new CPU for
> > the thread (we will get a callack for the target) but once the remote
> > thread start executing "sched_switch", we will get a callback on local
> > CPU, so we will adjust frequencies (provided 10ms interval from the
> > last call).
> >
> > >From the trace file I see the scenario where it took 72sec between two
> > updates:
> > CPU 2
> > 34412.597161    busy=78         freq=3232653
> > 34484.450725    busy=63         freq=2606793
> >
> > There is periodic activity in between, related to active load balancing
> > in scheduler (since last frequency was higher these small work will
> > also run at higher frequency). But those threads are not CFS class, so
> > scheduler callback will not be called for them.
> >
> > So removing the patch removed a trigger which would have caused a
> > sched_switch to a CFS task and call a cpufreq/intel_pstate callback.
> 
> And so this behavior needs to be restored for the time being which
> means reverting the problematic commit for 5.17 if possible.
> 
> It is unlikely that we'll get a proper fix before -rc7 and we still
> need to test it properly.

Thanks for checking this! 

I understand the time limit as we are approaching the close of 5.17,
but still I don't think reverting commit b50db7095fe0 is the best
solution as:

* b50db7095fe0 is not just an optimization, but solves some real
  problems found in servers from big CSP (Cloud Service Provider)
  and data center's server room.

* IMHO, b50db7095fe0 itself hasn't done anything wrong.

* This problem found by Doug is a rarely happened case, though
  it is an expected thing as shown in existing comments of
  cfs_rq_util_change():

	/*
	 * There are a few boundary cases this might miss but it should
	 * get called often enough that that should (hopefully) not be
	 * a real problem.
	 *
	 * It will not get called when we go idle, because the idle
	 * thread is a different class (!fair), nor will the utilization
	 * number include things like RT tasks.
	 *
	 * As is, the util number is not freq-invariant (we'd have to
	 * implement arch_scale_freq_capacity() for that).
	 *
	 * See cpu_util_cfs().
	 */
	cpufreq_update_util(rq, flags);

As this happens with HWP-disabled case and a very calm system, can we
find a proper solution in 5.17/5.18 or later, and cc stable?

> > But calling for every class, will be too many callbacks and not sure we
> > can even call for "stop" class, which these migration threads are
> > using.
> > "
> 
> Calling it for RT/deadline may not be a bad idea.
> 
> schedutil takes these classes into account when computing the
> utilization now (see effective_cpu_util()), so doing callbacks only
> for CFS seems insufficient.
> 
> Another way to avoid the issue at hand may be to prevent entering deep
> idle via PM QoS if the CPUs are running at high frequencies.
> 
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void)
> >   *
> >   * Called with polling cleared.
> >   */
> > +DEFINE_PER_CPU(u64, last_util_update_time);    /* in jiffies */
> >  static void do_idle(void)
> >  {
> >         int cpu = smp_processor_id();
> > +       u64 expire;
> >
> >         /*
> >          * Check if we need to update blocked load
> >          */
> >         nohz_run_idle_balance(cpu);
> >
> > +#ifdef CONFIG_X86_INTEL_PSTATE
> 
> Why?  Doesn't this affect the other ccpufreq governors?

You are right, this should be a generic thing.

I tried to limit its affect to other cases, though it's not necessary
for a debug patch.

Thanks,
Feng