On Tue, Feb 8, 2022 at 1:15 AM Feng Tang <feng.tang@xxxxxxxxx> wrote: > On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote: > > > > > > > > Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e > > > > " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms" > > > > > > > > There are now occasions where times between calls to the driver can be > > > > over 100's of seconds and can result in the CPU frequency being left > > > > unnecessarily high for extended periods. > > > > > > > > From the number of clock cycles executed between these long > > > > durations one can tell that the CPU has been running code, but > > > > the driver never got called. > > > > > > > > Attached are some graphs from some trace data acquired using > > > > intel_pstate_tracer.py where one can observe an idle system between > > > > about 42 and well over 200 seconds elapsed time, yet CPU10 never gets > > > > called, which would have resulted in reducing it's pstate request, until > > > > an elapsed time of 167.616 seconds, 126 seconds since the last call. The > > > > CPU frequency never does go to minimum. > > > > > > > > For reference, a similar CPU frequency graph is also attached, with > > > > the commit reverted. The CPU frequency drops to minimum, > > > > over about 10 or 15 seconds., > > > > > > commit b50db7095fe0 essentially disables the clocksource watchdog, > > > which literally doesn't have much to do with cpufreq code. > > > > > > One thing I can think of is, without the patch, there is a periodic > > > clocksource timer running every 500 ms, and it loops to run on > > > all CPUs in turn. For your HW, it has 12 CPUs (from the graph), > > > so each CPU will get a timer (HW timer interrupt backed) every 6 > > > seconds. Could this affect the cpufreq governor's work flow (I just > > > quickly read some cpufreq code, and seem there is irq_work/workqueue > > > involved). > > > > 6 Seconds is the longest duration I have ever seen on this > > processor before commit b50db7095fe0. > > > > I said "the times between calls to the driver have never > > exceeded 10 seconds" originally, but that involved other processors. > > > > I also did longer, 9000 second tests: > > > > For a reverted kernel the driver was called 131,743, > > and 0 times the duration was longer than 6.1 seconds. > > > > For a non-reverted kernel the driver was called 110,241 times, > > and 1397 times the duration was longer than 6.1 seconds, > > and the maximum duration was 303.6 seconds > > Thanks for the data, which shows it is related to the removal of > clocksource watchdog timers. And under this specific configurations, > the cpufreq work flow has some dependence on that watchdog timers. > > Also could you share you kernel config, boot message and some > system settings like for tickless mode, so that other people can > try to reproduce? thanks I steal the kernel configuration file from the Ubuntu mainline PPA [1], what they call "lowlatency", or 1000Hz tick. I make these changes before compile: scripts/config --disable DEBUG_INFO scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config --disable SYSTEM_REVOCATION_KEYS I also send you the config and dmesg files in an off-list email. This is an idle, and very low periodic loads, system type test. My test computer has no GUI and very few services running. Notice that I have not used the word "regression" yet in this thread, because I don't know for certain that it is. In the end, we don't care about CPU frequency, we care about wasting energy. It is definitely a change, and I am able to measure small increases in energy use, but this is all at the low end of the power curve. So far I have not found a significant example of increased power use, but I also have not looked very hard. During any test, many monitoring tools might shorten durations. For example if I run turbostat, say: sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval 2.5 Well, yes then the maximum duration would be 2.5 seconds, because turbostat wakes up each CPU to inquire about things causing a call to the CPU scaling driver. (I tested this, for about 900 seconds.) For my power tests I use a sample interval of >= 300 seconds. For duration only tests, turbostat is not run at the same time. My grub line: GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on cpuidle.governor=teo" A typical pstate tracer command (with the script copied to the directory where I run this stuff:): sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 --memory 800000 > > > > Can you try one test that keep all the current setting and change > > > the irq affinity of disk/network-card to 0xfff to let interrupts > > > from them be distributed to all CPUs? > > > > I am willing to do the test, but I do not know how to change the > > irq affinity. > > I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity" > (xx is the irq number of a device) to let interrupts be distributed > to all CPUs long time ago, but it doesn't work on my 2 desktops at hand. > Seems it only support one-cpu irq affinity in recent kernel. > > You can still try that command, though it may not work. I did not try this yet. [1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/