RE: CPU excessively long times between frequency scaling driver calls - bisected

"Zhang, Rui" <rui.zhang@xxxxxxxxx> · Thu, 10 Feb 2022 07:45:19 +0000

> -----Original Message-----
> From: Doug Smythies <dsmythies@xxxxxxxxx>
> Sent: Wednesday, February 09, 2022 2:23 PM
> To: Tang, Feng <feng.tang@xxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>; paulmck@xxxxxxxxxx;
> stable@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx; linux-pm@xxxxxxxxxxxxxxx; srinivas
> pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx>; dsmythies
> <dsmythies@xxxxxxxxx>
> Subject: Re: CPU excessively long times between frequency scaling driver
> calls - bisected
> 
> On Tue, Feb 8, 2022 at 1:15 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
> > On Mon, Feb 07, 2022 at 11:13:00PM -0800, Doug Smythies wrote:
> > > > >
> > > > > Since kernel 5.16-rc4 and commit:
> > > > > b50db7095fe002fa3e16605546cba66bf1b68a3e
> > > > > " x86/tsc: Disable clocksource watchdog for TSC on qualified platorms"
> > > > >
> > > > > There are now occasions where times between calls to the driver
> > > > > can be over 100's of seconds and can result in the CPU frequency
> > > > > being left unnecessarily high for extended periods.
> > > > >
> > > > > From the number of clock cycles executed between these long
> > > > > durations one can tell that the CPU has been running code, but
> > > > > the driver never got called.
> > > > >
> > > > > Attached are some graphs from some trace data acquired using
> > > > > intel_pstate_tracer.py where one can observe an idle system
> > > > > between about 42 and well over 200 seconds elapsed time, yet
> > > > > CPU10 never gets called, which would have resulted in reducing
> > > > > it's pstate request, until an elapsed time of 167.616 seconds,
> > > > > 126 seconds since the last call. The CPU frequency never does go to
> minimum.
> > > > >
> > > > > For reference, a similar CPU frequency graph is also attached,
> > > > > with the commit reverted. The CPU frequency drops to minimum,
> > > > > over about 10 or 15 seconds.,
> > > >
> > > > commit b50db7095fe0 essentially disables the clocksource watchdog,
> > > > which literally doesn't have much to do with cpufreq code.
> > > >
> > > > One thing I can think of is, without the patch, there is a
> > > > periodic clocksource timer running every 500 ms, and it loops to
> > > > run on all CPUs in turn. For your HW, it has 12 CPUs (from the
> > > > graph), so each CPU will get a timer (HW timer interrupt backed)
> > > > every 6 seconds. Could this affect the cpufreq governor's work
> > > > flow (I just quickly read some cpufreq code, and seem there is
> > > > irq_work/workqueue involved).
> > >
> > > 6 Seconds is the longest duration I have ever seen on this processor
> > > before commit b50db7095fe0.
> > >
> > > I said "the times between calls to the driver have never exceeded 10
> > > seconds" originally, but that involved other processors.
> > >
> > > I also did longer, 9000 second tests:
> > >
> > > For a reverted kernel the driver was called 131,743, and 0 times the
> > > duration was longer than 6.1 seconds.
> > >
> > > For a non-reverted kernel the driver was called 110,241 times, and
> > > 1397 times the duration was longer than 6.1 seconds, and the maximum
> > > duration was 303.6 seconds
> >
> > Thanks for the data, which shows it is related to the removal of
> > clocksource watchdog timers. And under this specific configurations,
> > the cpufreq work flow has some dependence on that watchdog timers.
> >
> > Also could you share you kernel config, boot message and some system
> > settings like for tickless mode, so that other people can try to
> > reproduce? thanks
> 
> I steal the kernel configuration file from the Ubuntu mainline PPA [1], what
> they call "lowlatency", or 1000Hz tick. I make these changes before compile:
> 
> scripts/config --disable DEBUG_INFO
> scripts/config --disable SYSTEM_TRUSTED_KEYS scripts/config --disable
> SYSTEM_REVOCATION_KEYS
> 
> I also send you the config and dmesg files in an off-list email.
> 
> This is an idle, and very low periodic loads, system type test.
> My test computer has no GUI and very few services running.
> Notice that I have not used the word "regression" yet in this thread, because
> I don't know for certain that it is. In the end, we don't care about CPU
> frequency, we care about wasting energy.
> It is definitely a change, and I am able to measure small increases in energy
> use, but this is all at the low end of the power curve.

What do you use to measure the energy use? And what difference do you observe?

> So far I have not found a significant example of increased power use, but I
> also have not looked very hard.
> 
> During any test, many monitoring tools might shorten durations.
> For example if I run turbostat, say:
> 
> sudo turbostat --Summary --quiet --show
> Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --
> interval
> 2.5
> 
> Well, yes then the maximum duration would be 2.5 seconds, because
> turbostat wakes up each CPU to inquire about things causing a call to the CPU
> scaling driver. (I tested this, for about
> 900 seconds.)
> 
> For my power tests I use a sample interval of >= 300 seconds.

So you use something like "turbostat sleep 900" for power test, and the RAPL Energy counters show the power difference?
Can you paste the turbostat output both w/ and w/o the watchdog?

Thanks,
rui

> For duration only tests, turbostat is not run at the same time.
> 
> My grub line:
> 
> GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=314
> intel_pstate=active intel_pstate=no_hwp msr.allow_writes=on
> cpuidle.governor=teo"
> 
> A typical pstate tracer command (with the script copied to the directory
> where I run this stuff:):
> 
> sudo ./intel_pstate_tracer.py --interval 600 --name vnew02 --memory
> 800000
> 
> >
> > > > Can you try one test that keep all the current setting and change
> > > > the irq affinity of disk/network-card to 0xfff to let interrupts
> > > > from them be distributed to all CPUs?
> > >
> > > I am willing to do the test, but I do not know how to change the irq
> > > affinity.
> >
> > I might say that too soon. I used to "echo fff > /proc/irq/xxx/smp_affinity"
> > (xx is the irq number of a device) to let interrupts be distributed to
> > all CPUs long time ago, but it doesn't work on my 2 desktops at hand.
> > Seems it only support one-cpu irq affinity in recent kernel.
> >
> > You can still try that command, though it may not work.
> 
> I did not try this yet.
> 
> [1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17-rc3/