> -----Original Message----- > From: Doug Smythies <dsmythies@xxxxxxxxx> > Sent: Wednesday, February 09, 2022 2:23 PM > To: Tang, Feng <feng.tang@xxxxxxxxx> > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>; paulmck@xxxxxxxxxx; > stable@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx; linux-pm@xxxxxxxxxxxxxxx; srinivas > pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx>; dsmythies > <dsmythies@xxxxxxxxx> > Subject: Re: CPU excessively long times between frequency scaling driver > calls - bisected > > On Tue, Feb 8, 2022 at 1:15 AM Feng Tang <feng.tang@xxxxxxxxx> wrote: > > On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote: > > > > > > > > > > Since kernel 5.16-rc4 and commit: > > > > > b50db7095fe002fa3e16605546cba66bf1b68a3e > > > > > " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms" > > > > > > > > > > There are now occasions where times between calls to the driver > > > > > can be over 100's of seconds and can result in the CPU frequency > > > > > being left unnecessarily high for extended periods. > > > > > > > > > > From the number of clock cycles executed between these long > > > > > durations one can tell that the CPU has been running code, but > > > > > the driver never got called. > > > > > > > > > > Attached are some graphs from some trace data acquired using > > > > > intel_pstate_tracer.py where one can observe an idle system > > > > > between about 42 and well over 200 seconds elapsed time, yet > > > > > CPU10 never gets called, which would have resulted in reducing > > > > > it's pstate request, until an elapsed time of 167.616 seconds, > > > > > 126 seconds since the last call. The CPU frequency never does go to > minimum. > > > > > > > > > > For reference, a similar CPU frequency graph is also attached, > > > > > with the commit reverted. The CPU frequency drops to minimum, > > > > > over about 10 or 15 seconds., > > > > > > > > commit b50db7095fe0 essentially disables the clocksource watchdog, > > > > which literally doesn't have much to do with cpufreq code. > > > > > > > > One thing I can think of is, without the patch, there is a > > > > periodic clocksource timer running every 500 ms, and it loops to > > > > run on all CPUs in turn. For your HW, it has 12 CPUs (from the > > > > graph), so each CPU will get a timer (HW timer interrupt backed) > > > > every 6 seconds. Could this affect the cpufreq governor's work > > > > flow (I just quickly read some cpufreq code, and seem there is > > > > irq_work/workqueue involved). > > > > > > 6 Seconds is the longest duration I have ever seen on this processor > > > before commit b50db7095fe0. > > > > > > I said "the times between calls to the driver have never exceeded 10 > > > seconds" originally, but that involved other processors. > > > > > > I also did longer, 9000 second tests: > > > > > > For a reverted kernel the driver was called 131,743, and 0 times the > > > duration was longer than 6.1 seconds. > > > > > > For a non-reverted kernel the driver was called 110,241 times, and > > > 1397 times the duration was longer than 6.1 seconds, and the maximum > > > duration was 303.6 seconds > > > > Thanks for the data, which shows it is related to the removal of > > clocksource watchdog timers. And under this specific configurations, > > the cpufreq work flow has some dependence on that watchdog timers. > > > > Also could you share you kernel config, boot message and some system > > settings like for tickless mode, so that other people can try to > > reproduce? thanks > > I steal the kernel configuration file from the Ubuntu mainline PPA [1], what > they call "lowlatency", or 1000Hz tick. I make these changes before compile: > > scripts/config --disable DEBUG_INFO > scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config --disable > SYSTEM_REVOCATION_KEYS > > I also send you the config and dmesg files in an off-list email. > > This is an idle, and very low periodic loads, system type test. > My test computer has no GUI and very few services running. > Notice that I have not used the word "regression" yet in this thread, because > I don't know for certain that it is. In the end, we don't care about CPU > frequency, we care about wasting energy. > It is definitely a change, and I am able to measure small increases in energy > use, but this is all at the low end of the power curve. What do you use to measure the energy use? And what difference do you observe? > So far I have not found a significant example of increased power use, but I > also have not looked very hard. > > During any test, many monitoring tools might shorten durations. > For example if I run turbostat, say: > > sudo turbostat --Summary --quiet --show > Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt -- > interval > 2.5 > > Well, yes then the maximum duration would be 2.5 seconds, because > turbostat wakes up each CPU to inquire about things causing a call to the CPU > scaling driver. (I tested this, for about > 900 seconds.) > > For my power tests I use a sample interval of >= 300 seconds. So you use something like "turbostat sleep 900" for power test, and the RAPL Energy counters show the power difference? Can you paste the turbostat output both w/ and w/o the watchdog? Thanks, rui > For duration only tests, turbostat is not run at the same time. > > My grub line: > > GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 > intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on > cpuidle.governor=teo" > > A typical pstate tracer command (with the script copied to the directory > where I run this stuff:): > > sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 --memory > 800000 > > > > > > > Can you try one test that keep all the current setting and change > > > > the irq affinity of disk/network-card to 0xfff to let interrupts > > > > from them be distributed to all CPUs? > > > > > > I am willing to do the test, but I do not know how to change the irq > > > affinity. > > > > I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity" > > (xx is the irq number of a device) to let interrupts be distributed to > > all CPUs long time ago, but it doesn't work on my 2 desktops at hand. > > Seems it only support one-cpu irq affinity in recent kernel. > > > > You can still try that command, though it may not work. > > I did not try this yet. > > [1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/