Re: CPU excessively long times between frequency scaling driver calls - bisected

Doug Smythies <dsmythies@xxxxxxxxx> · Thu, 3 Mar 2022 15:00:37 -0800

On Wed, Mar 2, 2022 at 11:01 AM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
> On Wed, Mar 2, 2022 at 5:06 AM Doug Smythies <dsmythies@xxxxxxxxx> wrote:
> > On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
> > > On Tue, Mar 1, 2022 at 6:18 PM Doug Smythies <dsmythies@xxxxxxxxx> wrote:
> > > > On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
> > > > > On Tue, Mar 1, 2022 at 6:53 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
> > > > > > On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote:
> > > > ...
> > > > > > >
> > > > > > > However, it was a bit racy, so maybe it's good that it was not complete.
> > > > > > >
> > > > > > > Below is a new version.
> > > > > >
> > > > > > Thanks for the new version. I just gave it a try,  and the occasional
> > > > > > long delay of cpufreq auto-adjusting I have seen can not be reproduced
> > > > > > after applying it.
> > > > >
> > > > > OK, thanks!
> > > > >
> > > > > I'll wait for feedback from Dough, though.
> > > >
> > > > Hi Rafael,
> > > >
> > > > Thank you for your version 2 patch.
> > > > I screwed up an overnight test and will have to re-do it.
> > > > However, I do have some results.
> > >
> > > Thanks for testing it!
> > >
> > > > From reading the patch code, one worry was the
> > > > potential to drive down the desired/required CPU
> > > > frequency for the main periodic workflow, causing
> > > > overruns, or inability of the task to complete its
> > > > work before the next period.
> > >
> > > It is not clear to me why you worried about that just from reading the
> > > patch?  Can you explain, please?
> >
> > It is already covered below. And a couple of further tests
> > directly contradict my thinking.
> >
> > > > I have always had overrun
> > > > information, but it has never been relevant before.
> > > >
> > > > The other worry was if the threshold of
> > > > turbo/not turbo frequency is enough.
> > >
> > > Agreed.
> >
> > Just as an easy example and test I did this on
> > top of this patch ("rjw-3"):
> >
> > doug@s19:~/kernel/linux$ git diff
> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > index f878a4545eee..5cbdd7e479e8 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct
> > cpudata *cpu)
> >          * P-states to prevent them from getting back to the high frequency
> >          * right away after getting out of deep idle.
> >          */
> > -       cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate);
> > +       cpuidle_update_retain_tick(pstate > (cpu->pstate.max_pstate >> 1));
>
> OK, but cpu->pstate.max_pstate / 2 may be almost
> cpu->pstate.min_pstate which would be better to use here IMO.

For my processor, i5-10600K, it works out to 2.05 GHz.
4.1 GHz / 2, whereas min pstate would be 0.8 GHz

> Or something like (cpu->pstate.max_pstate + cpu->pstate.min_pstate) /
> 2.  Can you try this one in particular?

Which I agree would be a much better general case to use.
For my processor that would be 2.45 GHz.
(4.1 + 0.8) / 2.

My code was just for testing, and I intended to go further than
we might want to for the final solution.

I'm holding off on trying your suggestion, for now.

> >         wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
> >  }
> >
> > > > I do not know how to test any final solution
> > > > thoroughly, as so far I have simply found a
> > > > good enough problematic example.
> > > > We have so many years of experience with
> > > > the convenient multi second NMI forcing
> > > > lingering high pstate clean up. I'd keep it
> > > > deciding within it if the TSC stuff needs to be
> > > > executed or not.
> > > >
> > > > Anyway...
> > > >
> > > > Base Kernel 5.17-rc3.
> > > > "stock" : unmodified.
> > > > "revert" : with commit b50db7095fe reverted
> > > > "rjw-2" : with this version2 patch added.
> > > >
> > > > Test 1 (as before. There is no test 2, yet.):
> > > > 347 Hertz work/sleep frequency on one CPU while others do
> > > > virtually no load but enough to increase the requested pstate,
> > > > but at around 0.02 hertz aggregate.
> > > >
> > > > It is important to note the main load is approximately
> > > > 38.6% @ 2.422 GHz, or 100% at 0.935 GHz.
> > > > and almost exclusively uses idle state 2 (of
> > > > 4 total idle states)
> > > >
> > > > /sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL
> > > > /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI
> > > > /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI
> > > > /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI
> > > >
> > > > Turbostat was used. ~10 samples at 300 seconds per.
> > > > Processor package power (Watts):
> > > >
> > > > Workflow was run for 1 hour each time or 1249201 loops.
> > > >
> > > > revert:
> > > > ave: 3.00
> > > > min: 2.89
> > > > max: 3.08
> > >
> > > I'm not sure what the above three numbers are.
> >
> > Processor package power, Watts.
>
> OK
>
> > > > ave freq: 2.422 GHz.
> > > > overruns: 1.
> > > > max overrun time: 113 uSec.
> > > >
> > > > stock:
> > > > ave: 3.63 (+21%)
> > > > min: 3.28
> > > > max: 3.99
> > > > ave freq: 2.791 GHz.
> > > > overruns: 2.
> > > > max overrun time: 677 uSec.
> > > >
> > > > rjw-2:
> > > > ave: 3.14 (+5%)
> > > > min: 2.97
> > > > max: 3.28
> > > > ave freq: 2.635 GHz
> > >
> > > I guess the numbers above could be reduced still by using a P-state
> > > below the max non-turbo one as a limit.
> >
> > Yes, and for a test I did "rjw-3".
> >
> > > > overruns: 1042.
> > > > max overrun time: 9,769 uSec.
> > >
> > > This would probably get worse then, though.
> >
> > Yes, that was my expectation, but not what happened.
> >
> > rjw-3:
> > ave: 3.09 watts
> > min: 3.01 watts
> > max: 31.7 watts
>
> Did you mean 3.17 here?  It would be hard to get the average of 3.09
> if the max was over 30 W.

Yes, 3.17 watts was what I meant to write. Sorry for any confusion.

> > ave freq: 2.42 GHz.
> > overruns: 12. (I did not expect this.)
> > Max overruns time: 621 uSec.
> >
> > Note 1: IRQ's increased by 74%. i.e. it was going in
> > and out of idle a lot more.
>
> That's because the scheduler tick is allowed to run a lot more often
> in the given workload with the changed test above.

Agreed.

> > Note 2: We know that processor package power
> > is highly temperature dependent. I forgot to let my
> > coolant cool adequately after the kernel compile,
> > and so had to throw out the first 4 power samples
> > (20 minutes).
> >
> > I retested both rjw-2 and rjw-3, but shorter tests
> > and got 0 overruns in both cases.
> >
> > > ATM I'm not quite sure why this happens, but you seem to have some
> > > insight into it, so it would help if you shared it.
> >
> > My insight seems questionable.
> >
> > My thinking was that one can not decide if the pstate needs to go
> > down or not based on such a localized look. The risk being that the
> > higher periodic load might suffer overruns. Since my first test did exactly
> > that, I violated my own "repeat all tests 3 times before reporting rule".
> > Now, I am not sure what is going on.
> > I will need more time to acquire traces and dig into it.
>
> It looks like the workload's behavior depends on updating the CPU
> performance scaling governor sufficiently often.
>
> Previously, that happened through the watchdog workflow that is gone
> now.  The rjw-2/3 patch is attempting to make up for that by letting
> the tick run more often.
>
> Admittedly, it is somewhat unclear to me why there are not so many
> overruns in the "stock" kernel test.  Perhaps that's because the CPUs
> generally run fast enough in that case, so the P-state governor need
> not be updated so often for the CPU frequency to stay high and that's
> what determines the performance (and in turn that decides whether or
> not there are overruns and how often they occur).  The other side of
> the coin is that the updates don't occur often enough for the power
> draw to be reasonable, though.

Agreed.

I am observing higher CPU frequencies than expected, and can not
determine why in all cases. It is going to take several days before
I can either reply with more detail or disprove what I think I am observing.

> Presumably, when the P-state governor gets updated more often (via the
> scheduler tick), but not often enough, it effectively causes the
> frequency (and consequently the performance) to drop over relatively
> long time intervals (and hence the increased occurrence of overruns).
> If it gets updated even more often (like in rjw-3), though, it causes
> the CPU frequency to follow the utilization more precisely and so the
> CPUs don't run "too fast" or "too slowly" for too long.

Agreed.

> > I also did a 1 hour intel_pstate_tracer test, with rjw-2, on an idle system
> > and saw several long durations. This was expected as this patch set
> > wouldn't change durations by more than a few jiffies.
> > 755 long durations (>6.1 seconds), and 327.7 seconds longest.