On Wed, Feb 23, 2022 at 3:49 AM Feng Tang <feng.tang@xxxxxxxxx> wrote: > > Hi Rafael, > > On Tue, Feb 22, 2022 at 07:04:32PM +0100, Rafael J. Wysocki wrote: > > On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@xxxxxxxxx> wrote: > > > > > > On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada wrote: > > > > Hi Doug, > > > > > > > > I think you use CONFIG_NO_HZ_FULL. > > > > Here we are getting callback from scheduler. Can we check that if > > > > scheduler woke up on those CPUs? > > > > We can run "trace-cmd -e sched" and check in kernel shark if there is > > > > similar gaps in activity. > > > > > > Srinivas analyzed the scheduler trace data from trace-cmd, and thought is > > > related with the cpufreq callback is not called timeley from scheduling > > > events: > > > > > > " > > > I mean we ignore the callback when the target CPU is not a local CPU as > > > we have to do IPI to adjust MSRs. > > > This will happen many times when sched_wake will wake up a new CPU for > > > the thread (we will get a callack for the target) but once the remote > > > thread start executing "sched_switch", we will get a callback on local > > > CPU, so we will adjust frequencies (provided 10ms interval from the > > > last call). > > > > > > >From the trace file I see the scenario where it took 72sec between two > > > updates: > > > CPU 2 > > > 34412.597161 busy=78 freq=3232653 > > > 34484.450725 busy=63 freq=2606793 > > > > > > There is periodic activity in between, related to active load balancing > > > in scheduler (since last frequency was higher these small work will > > > also run at higher frequency). But those threads are not CFS class, so > > > scheduler callback will not be called for them. > > > > > > So removing the patch removed a trigger which would have caused a > > > sched_switch to a CFS task and call a cpufreq/intel_pstate callback. > > > > And so this behavior needs to be restored for the time being which > > means reverting the problematic commit for 5.17 if possible. > > > > It is unlikely that we'll get a proper fix before -rc7 and we still > > need to test it properly. > > Thanks for checking this! > > I understand the time limit as we are approaching the close of 5.17, > but still I don't think reverting commit b50db7095fe0 is the best > solution as: > > * b50db7095fe0 is not just an optimization, but solves some real > problems found in servers from big CSP (Cloud Service Provider) > and data center's server room. > > * IMHO, b50db7095fe0 itself hasn't done anything wrong. > > * This problem found by Doug is a rarely happened case, though > it is an expected thing as shown in existing comments of > cfs_rq_util_change(): > > /* > * There are a few boundary cases this might miss but it should > * get called often enough that that should (hopefully) not be > * a real problem. > * > * It will not get called when we go idle, because the idle > * thread is a different class (!fair), nor will the utilization > * number include things like RT tasks. > * > * As is, the util number is not freq-invariant (we'd have to > * implement arch_scale_freq_capacity() for that). > * > * See cpu_util_cfs(). > */ > cpufreq_update_util(rq, flags); > > As this happens with HWP-disabled case and a very calm system, can we > find a proper solution in 5.17/5.18 or later, and cc stable? We can.