Re: CPU excessively long times between frequency scaling driver calls - bisected

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All,

I am about 1/2 way through testing Feng's "hacky debug patch",
let me know if I am wasting my time, and I'll abort. So far, it
works fine.

The compiler complains:

kernel/sched/idle.c: In function ‘do_idle’:
./include/linux/typecheck.h:12:18: warning: comparison of distinct
pointer types lacks a cast
   12 |  (void)(&__dummy == &__dummy2); \
      |                  ^~
./include/linux/compiler.h:78:42: note: in definition of macro ‘unlikely’
   78 | # define unlikely(x) __builtin_expect(!!(x), 0)
      |                                          ^
./include/linux/jiffies.h:106:3: note: in expansion of macro ‘typecheck’
  106 |   typecheck(unsigned long, b) && \
      |   ^~~~~~~~~
./include/linux/jiffies.h:154:35: note: in expansion of macro ‘time_after’
  154 | #define time_is_before_jiffies(a) time_after(jiffies, a)
      |                                   ^~~~~~~~~~
kernel/sched/idle.c:274:15: note: in expansion of macro ‘time_is_before_jiffies’
  274 |  if (unlikely(time_is_before_jiffies(expire))) {

Test 1: 347 Hertz work/sleep frequency on one CPU while others do
virtually no load, but at around 0.02 hertz aggregate.
Processor package power (Watts):

Kernel 5.17-rc3 + Feng patch (6 samples at 300 sec per):
Average: 3.2
Min: 3.1
Max: 3.3

Kernel 5.17-rc3 (Stock) re-stated:
> Stock: 5 samples @ 300 seconds per sample:
> average: 4.2 watts +31%
> minimum: 3.5 watts
> maximum: 4.9 watts

Kernel 5.17-rc3 with with b50db7095fe0 reverted. (Revert) re-stated:
> Revert: 5 samples @ 300 seconds per sample:
> average: 3.2 watts
> minimum: 3.1 watts
> maximum: 3.2 watts

I also ran intel_pstate_tracer.py for 5 hours 33 minutes
(20,000 seconds) on an idle system looking for long durations.
(just being thorough.) There were 12 occurrences of durations
longer than 6.1 seconds.
Max: 6.8 seconds. (O.K.)
I had expected about 3 seconds max, based on my
understanding of the patch code.

Old results restated, but corrected for a stupid mistake:

Stock:
1712 occurrences of durations longer than 6.1 seconds
Max: 303.6 seconds. (Bad)

Revert:
3 occurrences of durations longer than 6.1 seconds
Max: 6.5 seconds (O.K.)

On Tue, Feb 22, 2022 at 10:04 AM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
>
> On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
> >
> > On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote:
> > > Hi Doug,
> > >
> > > I think you use CONFIG_NO_HZ_FULL.
> > > Here we are getting callback from scheduler. Can we check that if
> > > scheduler woke up on those CPUs?
> > > We can run "trace-cmd -e sched" and check in kernel shark if there is
> > > similar gaps in activity.
> >
> > Srinivas analyzed the scheduler trace data from trace-cmd, and thought is
> > related with the cpufreq callback is not called timeley from scheduling
> > events:
> >
> > "
> > I mean we ignore the callback when the target CPU is not a local CPU as
> > we have to do IPI to adjust MSRs.
> > This will happen many times when sched_wake will wake up a new CPU for
> > the thread (we will get a callack for the target) but once the remote
> > thread start executing "sched_switch", we will get a callback on local
> > CPU, so we will adjust frequencies (provided 10ms interval from the
> > last call).
> >
> > >From the trace file I see the scenario where it took 72sec between two
> > updates:
> > CPU 2
> > 34412.597161    busy=78         freq=3232653
> > 34484.450725    busy=63         freq=2606793
> >
> > There is periodic activity in between, related to active load balancing
> > in scheduler (since last frequency was higher these small work will
> > also run at higher frequency). But those threads are not CFS class, so
> > scheduler callback will not be called for them.
> >
> > So removing the patch removed a trigger which would have caused a
> > sched_switch to a CFS task and call a cpufreq/intel_pstate callback.
>
> And so this behavior needs to be restored for the time being which
> means reverting the problematic commit for 5.17 if possible.
>
> It is unlikely that we'll get a proper fix before -rc7 and we still
> need to test it properly.
>
> > But calling for every class, will be too many callbacks and not sure we
> > can even call for "stop" class, which these migration threads are
> > using.
> > "
>
> Calling it for RT/deadline may not be a bad idea.
>
> schedutil takes these classes into account when computing the
> utilization now (see effective_cpu_util()), so doing callbacks only
> for CFS seems insufficient.
>
> Another way to avoid the issue at hand may be to prevent entering deep
> idle via PM QoS if the CPUs are running at high frequencies.
>
> > Following this direction, I made a hacky debug patch which should help
> > to restore the previous behavior.
> >
> > Doug, could you help to try it? thanks
> >
> > It basically tries to make sure the cpufreq-update-util be called timely
> > even for a silent system with very few interrupts (even from tick).
> >
> > Thanks,
> > Feng
> >
> > From 6be5f5da66a847860b0b9924fbb09f93b2e2d6e6 Mon Sep 17 00:00:00 2001
> > From: Feng Tang <feng.tang@xxxxxxxxx>
> > Date: Tue, 22 Feb 2022 22:59:00 +0800
> > Subject: [PATCH] idle/intel-pstate: hacky debug patch to make sure the
> >  cpufreq_update_util callback being called timely in silent system
> >
> > ---
> >  kernel/sched/idle.c  | 10 ++++++++++
> >  kernel/sched/sched.h | 13 +++++++++++++
> >  2 files changed, 23 insertions(+)
> >
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index d17b0a5ce6ac..cc538acb3f1a 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void)
> >   *
> >   * Called with polling cleared.
> >   */
> > +DEFINE_PER_CPU(u64, last_util_update_time);    /* in jiffies */
> >  static void do_idle(void)
> >  {
> >         int cpu = smp_processor_id();
> > +       u64 expire;
> >
> >         /*
> >          * Check if we need to update blocked load
> >          */
> >         nohz_run_idle_balance(cpu);
> >
> > +#ifdef CONFIG_X86_INTEL_PSTATE
>
> Why?  Doesn't this affect the other ccpufreq governors?

Yes, I have the same question.

>
> > +       expire = __this_cpu_read(last_util_update_time) + HZ * 3;
> > +       if (unlikely(time_is_before_jiffies(expire))) {
> > +               idle_update_util();
> > +               __this_cpu_write(last_util_update_time, get_jiffies_64());
> > +       }
> > +#endif
> > +
> >         /*
> >          * If the arch has a polling bit, we maintain an invariant:
> >          *
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 0e66749486e7..2a8d87988d1f 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2809,6 +2809,19 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
> >         if (data)
> >                 data->func(data, rq_clock(rq), flags);
> >  }
> > +
> > +static inline void idle_update_util(void)
> > +{
> > +       struct update_util_data *data;
> > +       struct rq *rq = cpu_rq(raw_smp_processor_id());
> > +
> > +       data = rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,
> > +                                                 cpu_of(rq)));
> > +       if (data)
> > +               data->func(data, rq_clock(rq), 0);
> > +}
> > +
> > +
> >  #else
> >  static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> >  #endif /* CONFIG_CPU_FREQ */
> > --




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux