Re: CPU excessively long times between frequency scaling driver calls - bisected

Doug Smythies <dsmythies@xxxxxxxxx> · Tue, 8 Feb 2022 22:23:13 -0800

On Tue, Feb 8, 2022 at 1:15 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
> On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote:
> > > >
> > > > Since kernel 5.16-rc4 and commit: b50db7095fe002fa3e16605546cba66bf1b68a3e
> > > > " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"
> > > >
> > > > There are now occasions where times between calls to the driver can be
> > > > over 100's of seconds and can result in the CPU frequency being left
> > > > unnecessarily high for extended periods.
> > > >
> > > > From the number of clock cycles executed between these long
> > > > durations one can tell that the CPU has been running code, but
> > > > the driver never got called.
> > > >
> > > > Attached are some graphs from some trace data acquired using
> > > > intel_pstate_tracer.py where one can observe an idle system between
> > > > about 42 and well over 200 seconds elapsed time, yet CPU10 never gets
> > > > called, which would have resulted in reducing it's pstate request, until
> > > > an elapsed time of 167.616 seconds, 126 seconds since the last call. The
> > > > CPU frequency never does go to minimum.
> > > >
> > > > For reference, a similar CPU frequency graph is also attached, with
> > > > the commit reverted. The CPU frequency drops to minimum,
> > > > over about 10 or 15 seconds.,
> > >
> > > commit b50db7095fe0 essentially disables the clocksource watchdog,
> > > which literally doesn't have much to do with cpufreq code.
> > >
> > > One thing I can think of is, without the patch, there is a periodic
> > > clocksource timer running every 500 ms, and it loops to run on
> > > all CPUs in turn. For your HW, it has 12 CPUs (from the graph),
> > > so each CPU will get a timer (HW timer interrupt backed) every 6
> > > seconds. Could this affect the cpufreq governor's work flow (I just
> > > quickly read some cpufreq code, and seem there is irq_work/workqueue
> > > involved).
> >
> > 6 Seconds is the longest duration I have ever seen on this
> > processor before commit b50db7095fe0.
> >
> > I said "the times between calls to the driver have never
> > exceeded 10 seconds" originally, but that involved other processors.
> >
> > I also did longer, 9000 second tests:
> >
> > For a reverted kernel the driver was called 131,743,
> > and 0 times the duration was longer than 6.1 seconds.
> >
> > For a non-reverted kernel the driver was called 110,241 times,
> > and 1397 times the duration was longer than 6.1 seconds,
> > and the maximum duration was 303.6 seconds
>
> Thanks for the data, which shows it is related to the removal of
> clocksource watchdog timers. And under this specific configurations,
> the cpufreq work flow has some dependence on that watchdog timers.
>
> Also could you share you kernel config, boot message and some
> system settings like for tickless mode, so that other people can
> try to reproduce? thanks

I steal the kernel configuration file from the Ubuntu mainline PPA
[1], what they call "lowlatency", or 1000Hz tick. I make these
changes before compile:

scripts/config --disable DEBUG_INFO
scripts/config --disable SYSTEM_TRUSTED_KEYS
scripts/config --disable SYSTEM_REVOCATION_KEYS

I also send you the config and dmesg files in an off-list email.

This is an idle, and very low periodic loads, system type test.
My test computer has no GUI and very few services running.
Notice that I have not used the word "regression" yet in this thread,
because I don't know for certain that it is. In the end, we don't
care about CPU frequency, we care about wasting energy.
It is definitely a change, and I am able to measure small increases
in energy use, but this is all at the low end of the power curve.
So far I have not found a significant example of increased power
use, but I also have not looked very hard.

During any test, many monitoring tools might shorten durations.
For example if I run turbostat, say:

sudo turbostat --Summary --quiet --show
Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval
2.5

Well, yes then the maximum duration would be 2.5 seconds,
because turbostat wakes up each CPU to inquire about things
causing a call to the CPU scaling driver. (I tested this, for about
900 seconds.)

For my power tests I use a sample interval of >= 300 seconds.
For duration only tests, turbostat is not run at the same time.

My grub line:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314
intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on
cpuidle.governor=teo"

A typical pstate tracer command (with the script copied to the
directory where I run this stuff:):

sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 --memory 800000

>
> > > Can you try one test that keep all the current setting and change
> > > the irq affinity of disk/network-card to 0xfff to let interrupts
> > > from them be distributed to all CPUs?
> >
> > I am willing to do the test, but I do not know how to change the
> > irq affinity.
>
> I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity"
> (xx is the irq number of a device) to let interrupts be distributed
> to all CPUs long time ago, but it doesn't work on my 2 desktops at hand.
> Seems it only support one-cpu irq affinity in recent kernel.
>
> You can still try that command, though it may not work.

I did not try this yet.

[1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/