Re: CPU excessively long times between frequency scaling driver calls - bisected

Doug Smythies <dsmythies@xxxxxxxxx> · Sun, 13 Feb 2022 10:54:00 -0800

Hi Rui,

On Wed, Feb 9, 2022 at 11:45 PM Zhang, Rui <rui.zhang@xxxxxxxxx> wrote:
> On 2022.02.09 14:23 Doug wrote:
> > On Tue, Feb 8, 2022 at 1:15 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
> > > On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote:
> > > > > >
> > > > > > Since kernel 5.16-rc4 and commit:
> > > > > > b50db7095fe002fa3e16605546cba66bf1b68a3e
> > > > > > " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"
> > > > > >
> > > > > > There are now occasions where times between calls to the driver
> > > > > > can be over 100's of seconds and can result in the CPU frequency
> > > > > > being left unnecessarily high for extended periods.
> > > > > >
> > > > > > From the number of clock cycles executed between these long
> > > > > > durations one can tell that the CPU has been running code, but
> > > > > > the driver never got called.
> > > > > >
> > > > > > Attached are some graphs from some trace data acquired using
> > > > > > intel_pstate_tracer.py where one can observe an idle system
> > > > > > between about 42 and well over 200 seconds elapsed time, yet
> > > > > > CPU10 never gets called, which would have resulted in reducing
> > > > > > it's pstate request, until an elapsed time of 167.616 seconds,
> > > > > > 126 seconds since the last call. The CPU frequency never does go to
> > minimum.
> > > > > >
> > > > > > For reference, a similar CPU frequency graph is also attached,
> > > > > > with the commit reverted. The CPU frequency drops to minimum,
> > > > > > over about 10 or 15 seconds.,
> > > > >
> > > > > commit b50db7095fe0 essentially disables the clocksource watchdog,
> > > > > which literally doesn't have much to do with cpufreq code.
> > > > >
> > > > > One thing I can think of is, without the patch, there is a
> > > > > periodic clocksource timer running every 500 ms, and it loops to
> > > > > run on all CPUs in turn. For your HW, it has 12 CPUs (from the
> > > > > graph), so each CPU will get a timer (HW timer interrupt backed)
> > > > > every 6 seconds. Could this affect the cpufreq governor's work
> > > > > flow (I just quickly read some cpufreq code, and seem there is
> > > > > irq_work/workqueue involved).
> > > >
> > > > 6 Seconds is the longest duration I have ever seen on this processor
> > > > before commit b50db7095fe0.
> > > >
> > > > I said "the times between calls to the driver have never exceeded 10
> > > > seconds" originally, but that involved other processors.
> > > >
> > > > I also did longer, 9000 second tests:
> > > >
> > > > For a reverted kernel the driver was called 131,743, and 0 times the
> > > > duration was longer than 6.1 seconds.
> > > >
> > > > For a non-reverted kernel the driver was called 110,241 times, and
> > > > 1397 times the duration was longer than 6.1 seconds, and the maximum
> > > > duration was 303.6 seconds
> > >
> > > Thanks for the data, which shows it is related to the removal of
> > > clocksource watchdog timers. And under this specific configurations,
> > > the cpufreq work flow has some dependence on that watchdog timers.
> > >
> > > Also could you share you kernel config, boot message and some system
> > > settings like for tickless mode, so that other people can try to
> > > reproduce? thanks
> >
> > I steal the kernel configuration file from the Ubuntu mainline PPA [1], what
> > they call "lowlatency", or 1000Hz tick. I make these changes before compile:
> >
> > scripts/config --disable DEBUG_INFO
> > scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config --disable
> > SYSTEM_REVOCATION_KEYS
> >
> > I also send you the config and dmesg files in an off-list email.
> >
> > This is an idle, and very low periodic loads, system type test.
> > My test computer has no GUI and very few services running.
> > Notice that I have not used the word "regression" yet in this thread, because
> > I don't know for certain that it is. In the end, we don't care about CPU
> > frequency, we care about wasting energy.
> > It is definitely a change, and I am able to measure small increases in energy
> > use, but this is all at the low end of the power curve.

Please keep the above statement in mind. The differences were rather minor,
Since Rui asks for data below, I have tried to find better examples.

> What do you use to measure the energy use? And what difference do you observe?

I use both turbostat and a processor power monitoring tool of my own.
Don't get me wrong, I love turbostat, but it has become somewhat heavy
(lots of code per interval) over recent years. My own utility has minimal
code per interval, only uses pre-allocated memory with no disk or terminal
interaction during the sampling phase, resulting in minimal system perturbation
due to itself.

> > So far I have not found a significant example of increased power use, but I
> > also have not looked very hard.

I looked a little harder for this reply, searching all single CPU
loads for a few
work/sleep frequencies (73, 113, 211, 347, 401 hertz) fixed work packet
per work interval.

> > During any test, many monitoring tools might shorten durations.
> > For example if I run turbostat, say:
> >
> > sudo turbostat --Summary --quiet --show
> > Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --
> > interval
> > 2.5
> >
> > Well, yes then the maximum duration would be 2.5 seconds, because
> > turbostat wakes up each CPU to inquire about things causing a call to the CPU
> > scaling driver. (I tested this, for about
> > 900 seconds.)
> >
> > For my power tests I use a sample interval of >= 300 seconds.
>
> So you use something like "turbostat sleep 900" for power test,

Typically I use 300 seconds for turbostat for this work.
It was only the other day, that I saw a duration of 302 seconds, so
yes I should try even longer, but it is becoming a tradeoff between
experiments taking many many hours and time available.

> and the RAPL Energy counters show the power difference?

Yes.

> Can you paste the turbostat output both w/ and w/o the watchdog?

Workflow:

Task 1: Purpose: main periodic load.

411 hertz work sleep frequency and about 26.6% load at 4.8 Ghz,
max_turbo (it would limit out at 100% duty cycle at about pstate 13)

Note: this is a higher load than I was using when I started this
email thread.

Task 2: Purpose: To assist in promoting manifestation of the
issue of potential concern with commit b50db7095fe0.

A short burst of work (about 30 milliseconds @ max turbo,
longer at lower frequencies), then sleep for 45 seconds.
Say, almost 0 load at 0.022 hertz.
Greater than or equal to 30 milliseconds at full load, ensures
that the intel_pstate driver will be called at least 2 or 3 times,
raising the requested pstate for that CPU.

Task 3: Purpose: Acquire power data with minimum system
perturbation due to this very monitoring task.

Both turbostat and my own power monitor were used, never
at the same time (unless just to prove they reported the same
power). This turbostat command:

turbostat --Summary --quiet --show
Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval
300

and my program also sampled at 300 second per sample,
typically 5 samples, and after 1/2 hour since boot.

Kernels:
stock: 5.17-rc3 as is.
revert: 5.17-rc3 with b50db7095fe0 reverted.

Test 1: no-hwp/active/powersave:

Doug's cpu_power_mon:

Stock: 5 samples @ 300 seconds per sample:
average: 4.7 watts +9%
minimum: 4.3 watts
maximum: 4.9 watts

Revert: 5 samples @ 300 seconds per sample:
average: 4.3 watts
minimum: 4.2 watts
maximum: 4.5 watts

Test 2: no-hwp/passive/schedutil:

In the beginning active/powersave was used
because it is the easiest (at least for me) to
understand and interpret the intel_pstate_tracer.py
results. Long durations are common in passive mode,
because something higher up can decide not to call the driver
because nothing changed. Very valuable lost information
in my opinion.

I didn't figure it out for awhile, but schedutil is bi-stable
with this workflow, depending on if it approaches steady
state from a higher or lower previous load (i.e. hysteresis).
With either kernel it might run for hours and hours at an
average of 6.1 watts or 4.9 watts. This difference dominates,
so trying to reveal if there is any contribution from the commit
of this thread is useless.

Test 3: Similar workflow as test1, but 347 Hertz and
a little less work per work period.

Task 2 was changed to use more CPUs to try
to potentially amplify manifestation of the effect.

 Doug's cpu_power_mon:

Stock: 5 samples @ 300 seconds per sample:
average: 4.2 watts +31%
minimum: 3.5 watts
maximum: 4.9 watts

Revert: 5 samples @ 300 seconds per sample:
average: 3.2 watts
minimum: 3.1 watts
maximum: 3.2 watts

Re-do test 1, but with the improved task 2:

Turbostat:

Stock:

doug@s19:~$ sudo
/home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary
--quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt
--interval 300
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt CorWatt GFXWatt RAMWatt
4.09    3335    282024  36      5.79    5.11    0.00    0.89
4.11    3449    283286  34      6.06    5.40    0.00    0.89
4.11    3504    284532  35      6.35    5.70    0.00    0.89
4.11    3493    283908  35      6.26    5.60    0.00    0.89
4.11    3499    284243  34      6.27    5.62    0.00    0.89

Revert:

doug@s19:~/freq-scalers/long_dur$ sudo
/home/doug/kernel/linux/tools/power/x86/turbostat/turbostat --Summary
--quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt
--interval 300
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt CorWatt GFXWatt RAMWatt
4.12    3018    283114  34      4.57    3.91    0.00    0.89
4.12    3023    283482  34      4.59    3.94    0.00    0.89
4.12    3017    282609  34      4.71    4.05    0.00    0.89
4.12    3013    282898  34      4.64    3.99    0.00    0.89
4.12    3013    282956  36      4.56    3.90    0.00    0.89
4.12    3026    282807  34      4.61    3.95    0.00    0.89
4.12    3026    282738  35      4.50    3.84    0.00    0.89

Or average of 6.2 watts versus 4.6 watts, +35%.

Several other tests had undetectable power differences
between kernels, but I did not go back and re-test with
the improved task 2.

>
> Thanks,
> rui
>
> > For duration only tests, turbostat is not run at the same time.
> >
> > My grub line:
> >
> > GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314
> > intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on
> > cpuidle.governor=teo"
> >
> > A typical pstate tracer command (with the script copied to the directory
> > where I run this stuff:):
> >
> > sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 --memory
> > 800000
> >
> > >
> > > > > Can you try one test that keep all the current setting and change
> > > > > the irq affinity of disk/network-card to 0xfff to let interrupts
> > > > > from them be distributed to all CPUs?
> > > >
> > > > I am willing to do the test, but I do not know how to change the irq
> > > > affinity.
> > >
> > > I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity"
> > > (xx is the irq number of a device) to let interrupts be distributed to
> > > all CPUs long time ago, but it doesn't work on my 2 desktops at hand.
> > > Seems it only support one-cpu irq affinity in recent kernel.
> > >
> > > You can still try that command, though it may not work.
> >
> > I did not try this yet.
> >
> > [1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/