Re: CPU excessively long times between frequency scaling driver calls - bisected

Doug Smythies <dsmythies@xxxxxxxxx> · Tue, 1 Mar 2022 20:06:24 -0800

On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
> On Tue, Mar 1, 2022 at 6:18 PM Doug Smythies <dsmythies@xxxxxxxxx> wrote:
> > On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
> > > On Tue, Mar 1, 2022 at 6:53 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
> > > > On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:
> > ...
> > > > >
> > > > > However, it was a bit racy, so maybe it's good that it was not complete.
> > > > >
> > > > > Below is a new version.
> > > >
> > > > Thanks for the new version. I just gave it a try,  and the occasional
> > > > long delay of cpufreq auto-adjusting I have seen can not be reproduced
> > > > after applying it.
> > >
> > > OK, thanks!
> > >
> > > I'll wait for feedback from Dough, though.
> >
> > Hi Rafael,
> >
> > Thank you for your version 2 patch.
> > I screwed up an overnight test and will have to re-do it.
> > However, I do have some results.
>
> Thanks for testing it!
>
> > From reading the patch code, one worry was the
> > potential to drive down the desired/required CPU
> > frequency for the main periodic workflow, causing
> > overruns, or inability of the task to complete its
> > work before the next period.
>
> It is not clear to me why you worried about that just from reading the
> patch?  Can you explain, please?

It is already covered below. And a couple of further tests
directly contradict my thinking.

> > I have always had overrun
> > information, but it has never been relevant before.
> >
> > The other worry was if the threshold of
> > turbo/not turbo frequency is enough.
>
> Agreed.

Just as an easy example and test I did this on
top of this patch ("rjw-3"):

doug@s19:~/kernel/linux$ git diff

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index f878a4545eee..5cbdd7e479e8 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct
cpudata *cpu)
         * P-states to prevent them from getting back to the high frequency
         * right away after getting out of deep idle.
         */
-       cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
+       cpuidle_update_retain_tick(pstate > (cpu->pstate.max_pstate >> 1));
        wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
 }

> > I do not know how to test any final solution
> > thoroughly, as so far I have simply found a
> > good enough problematic example.
> > We have so many years of experience with
> > the convenient multi second NMI forcing
> > lingering high pstate clean up. I'd keep it
> > deciding within it if the TSC stuff needs to be
> > executed or not.
> >
> > Anyway...
> >
> > Base Kernel 5.17-rc3.
> > "stock" : unmodified.
> > "revert" : with commit b50db7095fe reverted
> > "rjw-2" : with this version2 patch added.
> >
> > Test 1 (as before. There is no test 2, yet.):
> > 347 Hertz work/sleep frequency on one CPU while others do
> > virtually no load but enough to increase the requested pstate,
> > but at around 0.02 hertz aggregate.
> >
> > It is important to note the main load is approximately
> > 38.6% @ 2.422 GHz, or 100% at 0.935 GHz.
> > and almost exclusively uses idle state 2 (of
> > 4 total idle states)
> >
> > /sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL
> > /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI
> > /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI
> > /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI
> >
> > Turbostat was used. ~10 samples at 300 seconds per.
> > Processor package power (Watts):
> >
> > Workflow was run for 1 hour each time or 1249201 loops.
> >
> > revert:
> > ave: 3.00
> > min: 2.89
> > max: 3.08
>
> I'm not sure what the above three numbers are.

Processor package power, Watts.
>
> > ave freq: 2.422 GHz.
> > overruns: 1.
> > max overrun time: 113 uSec.
> >
> > stock:
> > ave: 3.63 (+21%)
> > min: 3.28
> > max: 3.99
> > ave freq: 2.791 GHz.
> > overruns: 2.
> > max overrun time: 677 uSec.
> >
> > rjw-2:
> > ave: 3.14 (+5%)
> > min: 2.97
> > max: 3.28
> > ave freq: 2.635 GHz
>
> I guess the numbers above could be reduced still by using a P-state
> below the max non-turbo one as a limit.

Yes, and for a test I did "rjw-3".

> > overruns: 1042.
> > max overrun time: 9,769 uSec.
>
> This would probably get worse then, though.

Yes, that was my expectation, but not what happened.

rjw-3:
ave: 3.09 watts
min: 3.01 watts
max: 31.7 watts
ave freq: 2.42 GHz.
overruns: 12. (I did not expect this.)
Max overruns time: 621 uSec.

Note 1: IRQ's increased by 74%. i.e. it was going in
and out of idle a lot more.

Note 2: We know that processor package power
is highly temperature dependent. I forgot to let my
coolant cool adequately after the kernel compile,
and so had to throw out the first 4 power samples
(20 minutes).

I retested both rjw-2 and rjw-3, but shorter tests
and got 0 overruns in both cases.

> ATM I'm not quite sure why this happens, but you seem to have some
> insight into it, so it would help if you shared it.

My insight seems questionable.

My thinking was that one can not decide if the pstate needs to go
down or not based on such a localized look. The risk being that the
higher periodic load might suffer overruns. Since my first test did exactly
that, I violated my own "repeat all tests 3 times before reporting rule".
Now, I am not sure what is going on.
I will need more time to acquire traces and dig into it.

I also did a 1 hour intel_pstate_tracer test, with rjw-2, on an idle system
and saw several long durations. This was expected as this patch set
wouldn't change durations by more than a few jiffies.
755 long durations (>6.1 seconds), and 327.7 seconds longest.

... Doug