Hi Doug, On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote: > Hi All, > > I am about 1/2 way through testing Feng's "hacky debug patch", > let me know if I am wasting my time, and I'll abort. So far, it > works fine. This just proves that if you add some callback during long idle, you will reach a less aggressive p-state. I think you already proved that with your results below showing 1W less average power ("Kernel 5.17-rc3 + Feng patch (6 samples at 300 sec per"). Rafael replied with one possible option. Alternatively when planing to enter deep idle, set P-state to min with a callback like we do in offline callback. So we need to think about a proper solution for this. Thanks, Srinivas > > The compiler complains: > > kernel/sched/idle.c: In function ‘do_idle’: > ./include/linux/typecheck.h:12:18: warning: comparison of distinct > pointer types lacks a cast > 12 | (void)(&__dummy == &__dummy2); \ > | ^~ > ./include/linux/compiler.h:78:42: note: in definition of macro > ‘unlikely’ > 78 | # define unlikely(x) __builtin_expect(!!(x), 0) > | ^ > ./include/linux/jiffies.h:106:3: note: in expansion of macro > ‘typecheck’ > 106 | typecheck(unsigned long, b) && \ > | ^~~~~~~~~ > ./include/linux/jiffies.h:154:35: note: in expansion of macro > ‘time_after’ > 154 | #define time_is_before_jiffies(a) time_after(jiffies, a) > | ^~~~~~~~~~ > kernel/sched/idle.c:274:15: note: in expansion of macro > ‘time_is_before_jiffies’ > 274 | if (unlikely(time_is_before_jiffies(expire))) { > > Test 1: 347 Hertz work/sleep frequency on one CPU while others do > virtually no load, but at around 0.02 hertz aggregate. > Processor package power (Watts): > > Kernel 5.17-rc3 + Feng patch (6 samples at 300 sec per): > Average: 3.2 > Min: 3.1 > Max: 3.3 > > Kernel 5.17-rc3 (Stock) re-stated: > > Stock: 5 samples @ 300 seconds per sample: > > average: 4.2 watts +31% > > minimum: 3.5 watts > > maximum: 4.9 watts > > Kernel 5.17-rc3 with with b50db7095fe0 reverted. (Revert) re-stated: > > Revert: 5 samples @ 300 seconds per sample: > > average: 3.2 watts > > minimum: 3.1 watts > > maximum: 3.2 watts > > I also ran intel_pstate_tracer.py for 5 hours 33 minutes > (20,000 seconds) on an idle system looking for long durations. > (just being thorough.) There were 12 occurrences of durations > longer than 6.1 seconds. > Max: 6.8 seconds. (O.K.) > I had expected about 3 seconds max, based on my > understanding of the patch code. > > Old results restated, but corrected for a stupid mistake: > > Stock: > 1712 occurrences of durations longer than 6.1 seconds > Max: 303.6 seconds. (Bad) > > Revert: > 3 occurrences of durations longer than 6.1 seconds > Max: 6.5 seconds (O.K.) > > On Tue, Feb 22, 2022 at 10:04 AM Rafael J. Wysocki > <rafael@xxxxxxxxxx> wrote: > > > > On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@xxxxxxxxx> > > wrote: > > > > > > On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada > > > wrote: > > > > Hi Doug, > > > > > > > > I think you use CONFIG_NO_HZ_FULL. > > > > Here we are getting callback from scheduler. Can we check that > > > > if > > > > scheduler woke up on those CPUs? > > > > We can run "trace-cmd -e sched" and check in kernel shark if > > > > there is > > > > similar gaps in activity. > > > > > > Srinivas analyzed the scheduler trace data from trace-cmd, and > > > thought is > > > related with the cpufreq callback is not called timeley from > > > scheduling > > > events: > > > > > > " > > > I mean we ignore the callback when the target CPU is not a local > > > CPU as > > > we have to do IPI to adjust MSRs. > > > This will happen many times when sched_wake will wake up a new > > > CPU for > > > the thread (we will get a callack for the target) but once the > > > remote > > > thread start executing "sched_switch", we will get a callback on > > > local > > > CPU, so we will adjust frequencies (provided 10ms interval from > > > the > > > last call). > > > > > > > From the trace file I see the scenario where it took 72sec > > > > between two > > > updates: > > > CPU 2 > > > 34412.597161 busy=78 freq=3232653 > > > 34484.450725 busy=63 freq=2606793 > > > > > > There is periodic activity in between, related to active load > > > balancing > > > in scheduler (since last frequency was higher these small work > > > will > > > also run at higher frequency). But those threads are not CFS > > > class, so > > > scheduler callback will not be called for them. > > > > > > So removing the patch removed a trigger which would have caused a > > > sched_switch to a CFS task and call a cpufreq/intel_pstate > > > callback. > > > > And so this behavior needs to be restored for the time being which > > means reverting the problematic commit for 5.17 if possible. > > > > It is unlikely that we'll get a proper fix before -rc7 and we still > > need to test it properly. > > > > > But calling for every class, will be too many callbacks and not > > > sure we > > > can even call for "stop" class, which these migration threads are > > > using. > > > " > > > > Calling it for RT/deadline may not be a bad idea. > > > > schedutil takes these classes into account when computing the > > utilization now (see effective_cpu_util()), so doing callbacks only > > for CFS seems insufficient. > > > > Another way to avoid the issue at hand may be to prevent entering > > deep > > idle via PM QoS if the CPUs are running at high frequencies. > > > > > Following this direction, I made a hacky debug patch which should > > > help > > > to restore the previous behavior. > > > > > > Doug, could you help to try it? thanks > > > > > > It basically tries to make sure the cpufreq-update-util be called > > > timely > > > even for a silent system with very few interrupts (even from > > > tick). > > > > > > Thanks, > > > Feng > > > > > > From 6be5f5da66a847860b0b9924fbb09f93b2e2d6e6 Mon Sep 17 00:00:00 > > > 2001 > > > From: Feng Tang <feng.tang@xxxxxxxxx> > > > Date: Tue, 22 Feb 2022 22:59:00 +0800 > > > Subject: [PATCH] idle/intel-pstate: hacky debug patch to make > > > sure the > > > cpufreq_update_util callback being called timely in silent > > > system > > > > > > --- > > > kernel/sched/idle.c | 10 ++++++++++ > > > kernel/sched/sched.h | 13 +++++++++++++ > > > 2 files changed, 23 insertions(+) > > > > > > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c > > > index d17b0a5ce6ac..cc538acb3f1a 100644 > > > --- a/kernel/sched/idle.c > > > +++ b/kernel/sched/idle.c > > > @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void) > > > * > > > * Called with polling cleared. > > > */ > > > +DEFINE_PER_CPU(u64, last_util_update_time); /* in jiffies */ > > > static void do_idle(void) > > > { > > > int cpu = smp_processor_id(); > > > + u64 expire; > > > > > > /* > > > * Check if we need to update blocked load > > > */ > > > nohz_run_idle_balance(cpu); > > > > > > +#ifdef CONFIG_X86_INTEL_PSTATE > > > > Why? Doesn't this affect the other ccpufreq governors? > > Yes, I have the same question. > > > > > > + expire = __this_cpu_read(last_util_update_time) + HZ * 3; > > > + if (unlikely(time_is_before_jiffies(expire))) { > > > + idle_update_util(); > > > + __this_cpu_write(last_util_update_time, > > > get_jiffies_64()); > > > + } > > > +#endif > > > + > > > /* > > > * If the arch has a polling bit, we maintain an > > > invariant: > > > * > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > > index 0e66749486e7..2a8d87988d1f 100644 > > > --- a/kernel/sched/sched.h > > > +++ b/kernel/sched/sched.h > > > @@ -2809,6 +2809,19 @@ static inline void > > > cpufreq_update_util(struct rq *rq, unsigned int flags) > > > if (data) > > > data->func(data, rq_clock(rq), flags); > > > } > > > + > > > +static inline void idle_update_util(void) > > > +{ > > > + struct update_util_data *data; > > > + struct rq *rq = cpu_rq(raw_smp_processor_id()); > > > + > > > + data = > > > rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data, > > > + cpu_of(rq))); > > > + if (data) > > > + data->func(data, rq_clock(rq), 0); > > > +} > > > + > > > + > > > #else > > > static inline void cpufreq_update_util(struct rq *rq, unsigned > > > int flags) {} > > > #endif /* CONFIG_CPU_FREQ */ > > > --