Hi Rafael, On Tue, Feb 22, 2022 at 07:04:32PM +0100, Rafael J. Wysocki wrote: > On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@xxxxxxxxx> wrote: > > > > On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote: > > > Hi Doug, > > > > > > I think you use CONFIG_NO_HZ_FULL. > > > Here we are getting callback from scheduler. Can we check that if > > > scheduler woke up on those CPUs? > > > We can run "trace-cmd -e sched" and check in kernel shark if there is > > > similar gaps in activity. > > > > Srinivas analyzed the scheduler trace data from trace-cmd, and thought is > > related with the cpufreq callback is not called timeley from scheduling > > events: > > > > " > > I mean we ignore the callback when the target CPU is not a local CPU as > > we have to do IPI to adjust MSRs. > > This will happen many times when sched_wake will wake up a new CPU for > > the thread (we will get a callack for the target) but once the remote > > thread start executing "sched_switch", we will get a callback on local > > CPU, so we will adjust frequencies (provided 10ms interval from the > > last call). > > > > >From the trace file I see the scenario where it took 72sec between two > > updates: > > CPU 2 > > 34412.597161 busy=78 freq=3232653 > > 34484.450725 busy=63 freq=2606793 > > > > There is periodic activity in between, related to active load balancing > > in scheduler (since last frequency was higher these small work will > > also run at higher frequency). But those threads are not CFS class, so > > scheduler callback will not be called for them. > > > > So removing the patch removed a trigger which would have caused a > > sched_switch to a CFS task and call a cpufreq/intel_pstate callback. > > And so this behavior needs to be restored for the time being which > means reverting the problematic commit for 5.17 if possible. > > It is unlikely that we'll get a proper fix before -rc7 and we still > need to test it properly. Thanks for checking this! I understand the time limit as we are approaching the close of 5.17, but still I don't think reverting commit b50db7095fe0 is the best solution as: * b50db7095fe0 is not just an optimization, but solves some real problems found in servers from big CSP (Cloud Service Provider) and data center's server room. * IMHO, b50db7095fe0 itself hasn't done anything wrong. * This problem found by Doug is a rarely happened case, though it is an expected thing as shown in existing comments of cfs_rq_util_change(): /* * There are a few boundary cases this might miss but it should * get called often enough that that should (hopefully) not be * a real problem. * * It will not get called when we go idle, because the idle * thread is a different class (!fair), nor will the utilization * number include things like RT tasks. * * As is, the util number is not freq-invariant (we'd have to * implement arch_scale_freq_capacity() for that). * * See cpu_util_cfs(). */ cpufreq_update_util(rq, flags); As this happens with HWP-disabled case and a very calm system, can we find a proper solution in 5.17/5.18 or later, and cc stable? > > But calling for every class, will be too many callbacks and not sure we > > can even call for "stop" class, which these migration threads are > > using. > > " > > Calling it for RT/deadline may not be a bad idea. > > schedutil takes these classes into account when computing the > utilization now (see effective_cpu_util()), so doing callbacks only > for CFS seems insufficient. > > Another way to avoid the issue at hand may be to prevent entering deep > idle via PM QoS if the CPUs are running at high frequencies. > > > --- a/kernel/sched/idle.c > > +++ b/kernel/sched/idle.c > > @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void) > > * > > * Called with polling cleared. > > */ > > +DEFINE_PER_CPU(u64, last_util_update_time); /* in jiffies */ > > static void do_idle(void) > > { > > int cpu = smp_processor_id(); > > + u64 expire; > > > > /* > > * Check if we need to update blocked load > > */ > > nohz_run_idle_balance(cpu); > > > > +#ifdef CONFIG_X86_INTEL_PSTATE > > Why? Doesn't this affect the other ccpufreq governors? You are right, this should be a generic thing. I tried to limit its affect to other cases, though it's not necessary for a debug patch. Thanks, Feng