Re: CPU excessively long times between frequency scaling driver calls - bisected

"Rafael J. Wysocki" <rafael@xxxxxxxxxx> · Wed, 23 Feb 2022 15:11:37 +0100

On Wed, Feb 23, 2022 at 3:49 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
>
> Hi Rafael,
>
> On Tue, Feb 22, 2022 at 07:04:32PM +0100, Rafael J. Wysocki wrote:
> > On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
> > >
> > > On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:
> > > > Hi Doug,
> > > >
> > > > I think you use CONFIG_NO_HZ_FULL.
> > > > Here we are getting callback from scheduler. Can we check that if
> > > > scheduler woke up on those CPUs?
> > > > We can run "trace-cmd -e sched" and check in kernel shark if there is
> > > > similar gaps in activity.
> > >
> > > Srinivas analyzed the scheduler trace data from trace-cmd, and thought is
> > > related with the cpufreq callback is not called timeley from scheduling
> > > events:
> > >
> > > "
> > > I mean we ignore the callback when the target CPU is not a local CPU as
> > > we have to do IPI to adjust MSRs.
> > > This will happen many times when sched_wake will wake up a new CPU for
> > > the thread (we will get a callack for the target) but once the remote
> > > thread start executing "sched_switch", we will get a callback on local
> > > CPU, so we will adjust frequencies (provided 10ms interval from the
> > > last call).
> > >
> > > >From the trace file I see the scenario where it took 72sec between two
> > > updates:
> > > CPU 2
> > > 34412.597161    busy=78         freq=3232653
> > > 34484.450725    busy=63         freq=2606793
> > >
> > > There is periodic activity in between, related to active load balancing
> > > in scheduler (since last frequency was higher these small work will
> > > also run at higher frequency). But those threads are not CFS class, so
> > > scheduler callback will not be called for them.
> > >
> > > So removing the patch removed a trigger which would have caused a
> > > sched_switch to a CFS task and call a cpufreq/intel_pstate callback.
> >
> > And so this behavior needs to be restored for the time being which
> > means reverting the problematic commit for 5.17 if possible.
> >
> > It is unlikely that we'll get a proper fix before -rc7 and we still
> > need to test it properly.
>
> Thanks for checking this!
>
> I understand the time limit as we are approaching the close of 5.17,
> but still I don't think reverting commit b50db7095fe0 is the best
> solution as:
>
> * b50db7095fe0 is not just an optimization, but solves some real
>   problems found in servers from big CSP (Cloud Service Provider)
>   and data center's server room.
>
> * IMHO, b50db7095fe0 itself hasn't done anything wrong.
>
> * This problem found by Doug is a rarely happened case, though
>   it is an expected thing as shown in existing comments of
>   cfs_rq_util_change():
>
>         /*
>          * There are a few boundary cases this might miss but it should
>          * get called often enough that that should (hopefully) not be
>          * a real problem.
>          *
>          * It will not get called when we go idle, because the idle
>          * thread is a different class (!fair), nor will the utilization
>          * number include things like RT tasks.
>          *
>          * As is, the util number is not freq-invariant (we'd have to
>          * implement arch_scale_freq_capacity() for that).
>          *
>          * See cpu_util_cfs().
>          */
>         cpufreq_update_util(rq, flags);
>
> As this happens with HWP-disabled case and a very calm system, can we
> find a proper solution in 5.17/5.18 or later, and cc stable?

We can.