Hi Doug, I think you use CONFIG_NO_HZ_FULL. Here we are getting callback from scheduler. Can we check that if scheduler woke up on those CPUs? We can run "trace-cmd -e sched" and check in kernel shark if there is similar gaps in activity. Thanks, Srinivas On Sun, 2022-02-13 at 10:54 -0800, Doug Smythies wrote: > Hi Rui, > > On Wed, Feb 9, 2022 at 11:45 PM Zhang, Rui <rui.zhang@xxxxxxxxx> > wrote: > > On 2022.02.09 14:23 Doug wrote: > > > On Tue, Feb 8, 2022 at 1:15 AM Feng Tang <feng.tang@xxxxxxxxx> > > > wrote: > > > > On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote: > > > > > > > > > > > > > > Since kernel 5.16-rc4 and commit: > > > > > > > b50db7095fe002fa3e16605546cba66bf1b68a3e > > > > > > > " x86/tsc: Disable clocksource watchdog for TSC on > > > > > > > qualified platorms" > > > > > > > > > > > > > > There are now occasions where times between calls to the > > > > > > > driver > > > > > > > can be over 100's of seconds and can result in the CPU > > > > > > > frequency > > > > > > > being left unnecessarily high for extended periods. > > > > > > > > > > > > > > From the number of clock cycles executed between these > > > > > > > long > > > > > > > durations one can tell that the CPU has been running > > > > > > > code, but > > > > > > > the driver never got called. > > > > > > > > > > > > > > Attached are some graphs from some trace data acquired > > > > > > > using > > > > > > > intel_pstate_tracer.py where one can observe an idle > > > > > > > system > > > > > > > between about 42 and well over 200 seconds elapsed time, > > > > > > > yet > > > > > > > CPU10 never gets called, which would have resulted in > > > > > > > reducing > > > > > > > it's pstate request, until an elapsed time of 167.616 > > > > > > > seconds, > > > > > > > 126 seconds since the last call. The CPU frequency never > > > > > > > does go to > > > minimum. > > > > > > > > > > > > > > For reference, a similar CPU frequency graph is also > > > > > > > attached, > > > > > > > with the commit reverted. The CPU frequency drops to > > > > > > > minimum, > > > > > > > over about 10 or 15 seconds., > > > > > > > > > > > > commit b50db7095fe0 essentially disables the clocksource > > > > > > watchdog, > > > > > > which literally doesn't have much to do with cpufreq code. > > > > > > > > > > > > One thing I can think of is, without the patch, there is a > > > > > > periodic clocksource timer running every 500 ms, and it > > > > > > loops to > > > > > > run on all CPUs in turn. For your HW, it has 12 CPUs (from > > > > > > the > > > > > > graph), so each CPU will get a timer (HW timer interrupt > > > > > > backed) > > > > > > every 6 seconds. Could this affect the cpufreq governor's > > > > > > work > > > > > > flow (I just quickly read some cpufreq code, and seem there > > > > > > is > > > > > > irq_work/workqueue involved). > > > > > > > > > > 6 Seconds is the longest duration I have ever seen on this > > > > > processor > > > > > before commit b50db7095fe0. > > > > > > > > > > I said "the times between calls to the driver have never > > > > > exceeded 10 > > > > > seconds" originally, but that involved other processors. > > > > > > > > > > I also did longer, 9000 second tests: > > > > > > > > > > For a reverted kernel the driver was called 131,743, and 0 > > > > > times the > > > > > duration was longer than 6.1 seconds. > > > > > > > > > > For a non-reverted kernel the driver was called 110,241 > > > > > times, and > > > > > 1397 times the duration was longer than 6.1 seconds, and the > > > > > maximum > > > > > duration was 303.6 seconds > > > > > > > > Thanks for the data, which shows it is related to the removal > > > > of > > > > clocksource watchdog timers. And under this specific > > > > configurations, > > > > the cpufreq work flow has some dependence on that watchdog > > > > timers. > > > > > > > > Also could you share you kernel config, boot message and some > > > > system > > > > settings like for tickless mode, so that other people can try > > > > to > > > > reproduce? thanks > > > > > > I steal the kernel configuration file from the Ubuntu mainline > > > PPA [1], what > > > they call "lowlatency", or 1000Hz tick. I make these changes > > > before compile: > > > > > > scripts/config --disable DEBUG_INFO > > > scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config -- > > > disable > > > SYSTEM_REVOCATION_KEYS > > > > > > I also send you the config and dmesg files in an off-list email. > > > > > > This is an idle, and very low periodic loads, system type test. > > > My test computer has no GUI and very few services running. > > > Notice that I have not used the word "regression" yet in this > > > thread, because > > > I don't know for certain that it is. In the end, we don't care > > > about CPU > > > frequency, we care about wasting energy. > > > It is definitely a change, and I am able to measure small > > > increases in energy > > > use, but this is all at the low end of the power curve. > > Please keep the above statement in mind. The differences were rather > minor, > Since Rui asks for data below, I have tried to find better examples. > > > What do you use to measure the energy use? And what difference do > > you observe? > > I use both turbostat and a processor power monitoring tool of my own. > Don't get me wrong, I love turbostat, but it has become somewhat > heavy > (lots of code per interval) over recent years. My own utility has > minimal > code per interval, only uses pre-allocated memory with no disk or > terminal > interaction during the sampling phase, resulting in minimal system > perturbation > due to itself. > > > > So far I have not found a significant example of increased power > > > use, but I > > > also have not looked very hard. > > I looked a little harder for this reply, searching all single CPU > loads for a few > work/sleep frequencies (73, 113, 211, 347, 401 hertz) fixed work > packet > per work interval. > > > > During any test, many monitoring tools might shorten durations. > > > For example if I run turbostat, say: > > > > > > sudo turbostat --Summary --quiet --show > > > Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt -- > > > interval > > > 2.5 > > > > > > Well, yes then the maximum duration would be 2.5 seconds, because > > > turbostat wakes up each CPU to inquire about things causing a > > > call to the CPU > > > scaling driver. (I tested this, for about > > > 900 seconds.) > > > > > > For my power tests I use a sample interval of >= 300 seconds. > > > > So you use something like "turbostat sleep 900" for power test, > > Typically I use 300 seconds for turbostat for this work. > It was only the other day, that I saw a duration of 302 seconds, so > yes I should try even longer, but it is becoming a tradeoff between > experiments taking many many hours and time available. > > > and the RAPL Energy counters show the power difference? > > Yes. > > > Can you paste the turbostat output both w/ and w/o the watchdog? > > Workflow: > > Task 1: Purpose: main periodic load. > > 411 hertz work sleep frequency and about 26.6% load at 4.8 Ghz, > max_turbo (it would limit out at 100% duty cycle at about pstate 13) > > Note: this is a higher load than I was using when I started this > email thread. > > Task 2: Purpose: To assist in promoting manifestation of the > issue of potential concern with commit b50db7095fe0. > > A short burst of work (about 30 milliseconds @ max turbo, > longer at lower frequencies), then sleep for 45 seconds. > Say, almost 0 load at 0.022 hertz. > Greater than or equal to 30 milliseconds at full load, ensures > that the intel_pstate driver will be called at least 2 or 3 times, > raising the requested pstate for that CPU. > > Task 3: Purpose: Acquire power data with minimum system > perturbation due to this very monitoring task. > > Both turbostat and my own power monitor were used, never > at the same time (unless just to prove they reported the same > power). This turbostat command: > > turbostat --Summary --quiet --show > Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval > 300 > > and my program also sampled at 300 second per sample, > typically 5 samples, and after 1/2 hour since boot. > > Kernels: > stock: 5.17-rc3 as is. > revert: 5.17-rc3 with b50db7095fe0 reverted. > > Test 1: no-hwp/active/powersave: > > Doug's cpu_power_mon: > > Stock: 5 samples @ 300 seconds per sample: > average: 4.7 watts +9% > minimum: 4.3 watts > maximum: 4.9 watts > > Revert: 5 samples @ 300 seconds per sample: > average: 4.3 watts > minimum: 4.2 watts > maximum: 4.5 watts > > Test 2: no-hwp/passive/schedutil: > > In the beginning active/powersave was used > because it is the easiest (at least for me) to > understand and interpret the intel_pstate_tracer.py > results. Long durations are common in passive mode, > because something higher up can decide not to call the driver > because nothing changed. Very valuable lost information > in my opinion. > > I didn't figure it out for awhile, but schedutil is bi-stable > with this workflow, depending on if it approaches steady > state from a higher or lower previous load (i.e. hysteresis). > With either kernel it might run for hours and hours at an > average of 6.1 watts or 4.9 watts. This difference dominates, > so trying to reveal if there is any contribution from the commit > of this thread is useless. > > Test 3: Similar workflow as test1, but 347 Hertz and > a little less work per work period. > > Task 2 was changed to use more CPUs to try > to potentially amplify manifestation of the effect. > > Doug's cpu_power_mon: > > Stock: 5 samples @ 300 seconds per sample: > average: 4.2 watts +31% > minimum: 3.5 watts > maximum: 4.9 watts > > Revert: 5 samples @ 300 seconds per sample: > average: 3.2 watts > minimum: 3.1 watts > maximum: 3.2 watts > > Re-do test 1, but with the improved task 2: > > Turbostat: > > Stock: > > doug@s19:~$ sudo > /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary > --quiet --show > Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt > --interval 300 > Busy% Bzy_MHz IRQ PkgTmp PkgWatt CorWatt GFXWatt RAMWatt > 4.09 3335 282024 36 5.79 5.11 0.00 0.89 > 4.11 3449 283286 34 6.06 5.40 0.00 0.89 > 4.11 3504 284532 35 6.35 5.70 0.00 0.89 > 4.11 3493 283908 35 6.26 5.60 0.00 0.89 > 4.11 3499 284243 34 6.27 5.62 0.00 0.89 > > Revert: > > doug@s19:~/freq-scalers/long_dur$ sudo > /home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary > --quiet --show > Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt > --interval 300 > Busy% Bzy_MHz IRQ PkgTmp PkgWatt CorWatt GFXWatt RAMWatt > 4.12 3018 283114 34 4.57 3.91 0.00 0.89 > 4.12 3023 283482 34 4.59 3.94 0.00 0.89 > 4.12 3017 282609 34 4.71 4.05 0.00 0.89 > 4.12 3013 282898 34 4.64 3.99 0.00 0.89 > 4.12 3013 282956 36 4.56 3.90 0.00 0.89 > 4.12 3026 282807 34 4.61 3.95 0.00 0.89 > 4.12 3026 282738 35 4.50 3.84 0.00 0.89 > > Or average of 6.2 watts versus 4.6 watts, +35%. > > Several other tests had undetectable power differences > between kernels, but I did not go back and re-test with > the improved task 2. > > > > > Thanks, > > rui > > > > > For duration only tests, turbostat is not run at the same time. > > > > > > My grub line: > > > > > > GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314 > > > intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on > > > cpuidle.governor=teo" > > > > > > A typical pstate tracer command (with the script copied to the > > > directory > > > where I run this stuff:): > > > > > > sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 -- > > > memory > > > 800000 > > > > > > > > > > > > > Can you try one test that keep all the current setting and > > > > > > change > > > > > > the irq affinity of disk/network-card to 0xfff to let > > > > > > interrupts > > > > > > from them be distributed to all CPUs? > > > > > > > > > > I am willing to do the test, but I do not know how to change > > > > > the irq > > > > > affinity. > > > > > > > > I might say that too soon. I used to "echo fff > > > > > /proc/irq/xxx/smp_affinity" > > > > (xx is the irq number of a device) to let interrupts be > > > > distributed to > > > > all CPUs long time ago, but it doesn't work on my 2 desktops at > > > > hand. > > > > Seems it only support one-cpu irq affinity in recent kernel. > > > > > > > > You can still try that command, though it may not work. > > > > > > I did not try this yet. > > > > > > [1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/