On Wed, Feb 23, 2022 at 03:23:20PM +0100, Rafael J. Wysocki wrote: > On Wed, Feb 23, 2022 at 1:40 AM Feng Tang <feng.tang@xxxxxxxxx> wrote: > > > > On Tue, Feb 22, 2022 at 04:32:29PM -0800, srinivas pandruvada wrote: > > > Hi Doug, > > > > > > On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote: > > > > Hi All, > > > > > > > > I am about 1/2 way through testing Feng's "hacky debug patch", > > > > let me know if I am wasting my time, and I'll abort. So far, it > > > > works fine. > > > This just proves that if you add some callback during long idle, you > > > will reach a less aggressive p-state. I think you already proved that > > > with your results below showing 1W less average power ("Kernel 5.17-rc3 > > > + Feng patch (6 samples at 300 sec per"). > > > > > > Rafael replied with one possible option. Alternatively when planing to > > > enter deep idle, set P-state to min with a callback like we do in > > > offline callback. > > > > Yes, if the system is going to idle, it makes sense to goto a lower > > cpufreq first (also what my debug patch will essentially lead to). > > > > Given cprfreq-util's normal running frequency is every 10ms, doing > > this before entering idle is not a big extra burden. > > But this is not related to idle as such, but to the fact that idle > sometimes stops the scheduler tick which otherwise would run the > cpufreq governor callback on a regular basis. > > It is stopping the tick that gets us into trouble, so I would avoid > doing it if the current performance state is too aggressive. I've tried to simulate Doug's environment by using his kconfig, and offline my 36 CPUs Desktop to leave 12 CPUs online, and on it I can still see Local timer interrupts when there is no active load, with the longest interval between 2 timer interrupts is 4 seconds, while idle class's task_tick_idle() will do nothing, and CFS' task_tick_fair() will in turn call cfs_rq_util_change() I searched the cfs/deadline/rt code, these three classes all have places to call cpufreq_update_util(), either in enqueue/dequeue or changing running bandwidth. So I think entering idle also means the system load is under a big change, and worth calling the cpufreq callback. > In principle, PM QoS can be used for that from intel_pstate, but there > is a problem with that approach, because it is not obvious what value > to give to it and it is not always guaranteed to work (say when all of > the C-states except for C1 are disabled). > > So it looks like a new mechanism is needed for that. If you think idle class is not the right place to solve it, I can also help testing new patches. Thanks, Feng