Re: CPU excessively long times between frequency scaling driver calls - bisected

"Rafael J. Wysocki" <rafael@xxxxxxxxxx> · Tue, 22 Feb 2022 19:04:32 +0100

On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
>
> On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:
> > Hi Doug,
> >
> > I think you use CONFIG_NO_HZ_FULL.
> > Here we are getting callback from scheduler. Can we check that if
> > scheduler woke up on those CPUs?
> > We can run "trace-cmd -e sched" and check in kernel shark if there is
> > similar gaps in activity.
>
> Srinivas analyzed the scheduler trace data from trace-cmd, and thought is
> related with the cpufreq callback is not called timeley from scheduling
> events:
>
> "
> I mean we ignore the callback when the target CPU is not a local CPU as
> we have to do IPI to adjust MSRs.
> This will happen many times when sched_wake will wake up a new CPU for
> the thread (we will get a callack for the target) but once the remote
> thread start executing "sched_switch", we will get a callback on local
> CPU, so we will adjust frequencies (provided 10ms interval from the
> last call).
>
> >From the trace file I see the scenario where it took 72sec between two
> updates:
> CPU 2
> 34412.597161    busy=78         freq=3232653
> 34484.450725    busy=63         freq=2606793
>
> There is periodic activity in between, related to active load balancing
> in scheduler (since last frequency was higher these small work will
> also run at higher frequency). But those threads are not CFS class, so
> scheduler callback will not be called for them.
>
> So removing the patch removed a trigger which would have caused a
> sched_switch to a CFS task and call a cpufreq/intel_pstate callback.

And so this behavior needs to be restored for the time being which
means reverting the problematic commit for 5.17 if possible.

It is unlikely that we'll get a proper fix before -rc7 and we still
need to test it properly.

> But calling for every class, will be too many callbacks and not sure we
> can even call for "stop" class, which these migration threads are
> using.
> "

Calling it for RT/deadline may not be a bad idea.

schedutil takes these classes into account when computing the
utilization now (see effective_cpu_util()), so doing callbacks only
for CFS seems insufficient.

Another way to avoid the issue at hand may be to prevent entering deep
idle via PM QoS if the CPUs are running at high frequencies.

> Following this direction, I made a hacky debug patch which should help
> to restore the previous behavior.
>
> Doug, could you help to try it? thanks
>
> It basically tries to make sure the cpufreq-update-util be called timely
> even for a silent system with very few interrupts (even from tick).
>
> Thanks,
> Feng
>
> From 6be5f5da66a847860b0b9924fbb09f93b2e2d6e6 Mon Sep 17 00:00:00 2001
> From: Feng Tang <feng.tang@xxxxxxxxx>
> Date: Tue, 22 Feb 2022 22:59:00 +0800
> Subject: [PATCH] idle/intel-pstate: hacky debug patch to make sure the
>  cpufreq_update_util callback being called timely in silent system
>
> ---
>  kernel/sched/idle.c  | 10 ++++++++++
>  kernel/sched/sched.h | 13 +++++++++++++
>  2 files changed, 23 insertions(+)
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index d17b0a5ce6ac..cc538acb3f1a 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void)
>   *
>   * Called with polling cleared.
>   */
> +DEFINE_PER_CPU(u64, last_util_update_time);    /* in jiffies */
>  static void do_idle(void)
>  {
>         int cpu = smp_processor_id();
> +       u64 expire;
>
>         /*
>          * Check if we need to update blocked load
>          */
>         nohz_run_idle_balance(cpu);
>
> +#ifdef CONFIG_X86_INTEL_PSTATE

Why?  Doesn't this affect the other ccpufreq governors?

> +       expire = __this_cpu_read(last_util_update_time) + HZ * 3;
> +       if (unlikely(time_is_before_jiffies(expire))) {
> +               idle_update_util();
> +               __this_cpu_write(last_util_update_time, get_jiffies_64());
> +       }
> +#endif
> +
>         /*
>          * If the arch has a polling bit, we maintain an invariant:
>          *
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 0e66749486e7..2a8d87988d1f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2809,6 +2809,19 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
>         if (data)
>                 data->func(data, rq_clock(rq), flags);
>  }
> +
> +static inline void idle_update_util(void)
> +{
> +       struct update_util_data *data;
> +       struct rq *rq = cpu_rq(raw_smp_processor_id());
> +
> +       data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,
> +                                                 cpu_of(rq)));
> +       if (data)
> +               data->func(data, rq_clock(rq), 0);
> +}
> +
> +
>  #else
>  static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
>  #endif /* CONFIG_CPU_FREQ */
> --