On Wed, Mar 2, 2022 at 11:01 AM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote: > On Wed, Mar 2, 2022 at 5:06 AM Doug Smythies <dsmythies@xxxxxxxxx> wrote: > > On Tue, Mar 1, 2022 at 9:34 AM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote: > > > On Tue, Mar 1, 2022 at 6:18 PM Doug Smythies <dsmythies@xxxxxxxxx> wrote: > > > > On Tue, Mar 1, 2022 at 3:58 AM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote: > > > > > On Tue, Mar 1, 2022 at 6:53 AM Feng Tang <feng.tang@xxxxxxxxx> wrote: > > > > > > On Mon, Feb 28, 2022 at 08:36:03PM +0100, Rafael J. Wysocki wrote: > > > > ... > > > > > > > > > > > > > > However, it was a bit racy, so maybe it's good that it was not complete. > > > > > > > > > > > > > > Below is a new version. > > > > > > > > > > > > Thanks for the new version. I just gave it a try, and the occasional > > > > > > long delay of cpufreq auto-adjusting I have seen can not be reproduced > > > > > > after applying it. > > > > > > > > > > OK, thanks! > > > > > > > > > > I'll wait for feedback from Dough, though. > > > > > > > > Hi Rafael, > > > > > > > > Thank you for your version 2 patch. > > > > I screwed up an overnight test and will have to re-do it. > > > > However, I do have some results. > > > > > > Thanks for testing it! > > > > > > > From reading the patch code, one worry was the > > > > potential to drive down the desired/required CPU > > > > frequency for the main periodic workflow, causing > > > > overruns, or inability of the task to complete its > > > > work before the next period. > > > > > > It is not clear to me why you worried about that just from reading the > > > patch? Can you explain, please? > > > > It is already covered below. And a couple of further tests > > directly contradict my thinking. > > > > > > I have always had overrun > > > > information, but it has never been relevant before. > > > > > > > > The other worry was if the threshold of > > > > turbo/not turbo frequency is enough. > > > > > > Agreed. > > > > Just as an easy example and test I did this on > > top of this patch ("rjw-3"): > > > > doug@s19:~/kernel/linux$ git diff > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > > index f878a4545eee..5cbdd7e479e8 100644 > > --- a/drivers/cpufreq/intel_pstate.c > > +++ b/drivers/cpufreq/intel_pstate.c > > @@ -1980,7 +1980,7 @@ static void intel_pstate_update_perf_ctl(struct > > cpudata *cpu) > > * P-states to prevent them from getting back to the high frequency > > * right away after getting out of deep idle. > > */ > > - cpuidle_update_retain_tick(pstate > cpu->pstate.max_pstate); > > + cpuidle_update_retain_tick(pstate > (cpu->pstate.max_pstate >> 1)); > > OK, but cpu->pstate.max_pstate / 2 may be almost > cpu->pstate.min_pstate which would be better to use here IMO. For my processor, i5-10600K, it works out to 2.05 GHz. 4.1 GHz / 2, whereas min pstate would be 0.8 GHz > Or something like (cpu->pstate.max_pstate + cpu->pstate.min_pstate) / > 2. Can you try this one in particular? Which I agree would be a much better general case to use. For my processor that would be 2.45 GHz. (4.1 + 0.8) / 2. My code was just for testing, and I intended to go further than we might want to for the final solution. I'm holding off on trying your suggestion, for now. > > wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate)); > > } > > > > > > I do not know how to test any final solution > > > > thoroughly, as so far I have simply found a > > > > good enough problematic example. > > > > We have so many years of experience with > > > > the convenient multi second NMI forcing > > > > lingering high pstate clean up. I'd keep it > > > > deciding within it if the TSC stuff needs to be > > > > executed or not. > > > > > > > > Anyway... > > > > > > > > Base Kernel 5.17-rc3. > > > > "stock" : unmodified. > > > > "revert" : with commit b50db7095fe reverted > > > > "rjw-2" : with this version2 patch added. > > > > > > > > Test 1 (as before. There is no test 2, yet.): > > > > 347 Hertz work/sleep frequency on one CPU while others do > > > > virtually no load but enough to increase the requested pstate, > > > > but at around 0.02 hertz aggregate. > > > > > > > > It is important to note the main load is approximately > > > > 38.6% @ 2.422 GHz, or 100% at 0.935 GHz. > > > > and almost exclusively uses idle state 2 (of > > > > 4 total idle states) > > > > > > > > /sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL > > > > /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI > > > > /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI > > > > /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI > > > > > > > > Turbostat was used. ~10 samples at 300 seconds per. > > > > Processor package power (Watts): > > > > > > > > Workflow was run for 1 hour each time or 1249201 loops. > > > > > > > > revert: > > > > ave: 3.00 > > > > min: 2.89 > > > > max: 3.08 > > > > > > I'm not sure what the above three numbers are. > > > > Processor package power, Watts. > > OK > > > > > ave freq: 2.422 GHz. > > > > overruns: 1. > > > > max overrun time: 113 uSec. > > > > > > > > stock: > > > > ave: 3.63 (+21%) > > > > min: 3.28 > > > > max: 3.99 > > > > ave freq: 2.791 GHz. > > > > overruns: 2. > > > > max overrun time: 677 uSec. > > > > > > > > rjw-2: > > > > ave: 3.14 (+5%) > > > > min: 2.97 > > > > max: 3.28 > > > > ave freq: 2.635 GHz > > > > > > I guess the numbers above could be reduced still by using a P-state > > > below the max non-turbo one as a limit. > > > > Yes, and for a test I did "rjw-3". > > > > > > overruns: 1042. > > > > max overrun time: 9,769 uSec. > > > > > > This would probably get worse then, though. > > > > Yes, that was my expectation, but not what happened. > > > > rjw-3: > > ave: 3.09 watts > > min: 3.01 watts > > max: 31.7 watts > > Did you mean 3.17 here? It would be hard to get the average of 3.09 > if the max was over 30 W. Yes, 3.17 watts was what I meant to write. Sorry for any confusion. > > ave freq: 2.42 GHz. > > overruns: 12. (I did not expect this.) > > Max overruns time: 621 uSec. > > > > Note 1: IRQ's increased by 74%. i.e. it was going in > > and out of idle a lot more. > > That's because the scheduler tick is allowed to run a lot more often > in the given workload with the changed test above. Agreed. > > Note 2: We know that processor package power > > is highly temperature dependent. I forgot to let my > > coolant cool adequately after the kernel compile, > > and so had to throw out the first 4 power samples > > (20 minutes). > > > > I retested both rjw-2 and rjw-3, but shorter tests > > and got 0 overruns in both cases. > > > > > ATM I'm not quite sure why this happens, but you seem to have some > > > insight into it, so it would help if you shared it. > > > > My insight seems questionable. > > > > My thinking was that one can not decide if the pstate needs to go > > down or not based on such a localized look. The risk being that the > > higher periodic load might suffer overruns. Since my first test did exactly > > that, I violated my own "repeat all tests 3 times before reporting rule". > > Now, I am not sure what is going on. > > I will need more time to acquire traces and dig into it. > > It looks like the workload's behavior depends on updating the CPU > performance scaling governor sufficiently often. > > Previously, that happened through the watchdog workflow that is gone > now. The rjw-2/3 patch is attempting to make up for that by letting > the tick run more often. > > Admittedly, it is somewhat unclear to me why there are not so many > overruns in the "stock" kernel test. Perhaps that's because the CPUs > generally run fast enough in that case, so the P-state governor need > not be updated so often for the CPU frequency to stay high and that's > what determines the performance (and in turn that decides whether or > not there are overruns and how often they occur). The other side of > the coin is that the updates don't occur often enough for the power > draw to be reasonable, though. Agreed. I am observing higher CPU frequencies than expected, and can not determine why in all cases. It is going to take several days before I can either reply with more detail or disprove what I think I am observing. > Presumably, when the P-state governor gets updated more often (via the > scheduler tick), but not often enough, it effectively causes the > frequency (and consequently the performance) to drop over relatively > long time intervals (and hence the increased occurrence of overruns). > If it gets updated even more often (like in rjw-3), though, it causes > the CPU frequency to follow the utilization more precisely and so the > CPUs don't run "too fast" or "too slowly" for too long. Agreed. > > I also did a 1 hour intel_pstate_tracer test, with rjw-2, on an idle system > > and saw several long durations. This was expected as this patch set > > wouldn't change durations by more than a few jiffies. > > 755 long durations (>6.1 seconds), and 327.7 seconds longest.