On Wed, 2018-04-11 at 09:26 -0700, Francisco Jerez wrote: > > "just like" here is possibly somewhat unfair to the schedutil > governor, > admittedly its progressive IOWAIT boosting behavior seems somewhat > less > wasteful than the intel_pstate non-HWP governor's IOWAIT boosting > behavior, but it's still largely unhelpful on IO-bound conditions. > OK, if you think so, then improve it for sched-util governor or other mechanisms (as Juri suggested) instead of intel-pstate. This will benefit all architectures including x86 + non i915. BTW intel-pstate can be driven by sched-util governor (passive mode), so if your prove benefits to Broxton, this can be a default. As before: - No regression to idle power at all. This is more important than benchmarks - Not just score, performance/watt is important Thanks, Srinivas > > controller does, even though the frequent IO waits may actually be > > an > > indication that the system is IO-bound (which means that the large > > energy usage increase may not be translated in any performance > > benefit > > in practice, not to speak of performance being impacted negatively > > in > > TDP-bound scenarios like GPU rendering). > > > > Regarding run-time complexity, I haven't observed this governor to > > be > > measurably more computationally intensive than the present > > one. It's a > > bunch more instructions indeed, but still within the same ballpark > > as > > the current governor. The average increase in CPU utilization on > > my BXT > > with this series is less than 0.03% (sampled via ftrace for v1, I > > can > > repeat the measurement for the v2 I have in the works, though I > > don't > > expect the result to be substantially different). If this is a > > problem > > for you there are several optimization opportunities that would cut > > down > > the number of CPU cycles get_target_pstate_lp() takes to execute by > > a > > large percent (most of the optimization ideas I can think of right > > now > > though would come at some accuracy/maintainability/debuggability > > cost, > > but may still be worth pursuing), but the computational overhead is > > low > > enough at this point that the impact on any benchmark or real > > workload > > would be orders of magnitude lower than its variance, which makes > > it > > kind of difficult to keep the discussion data-driven [as possibly > > any > > performance optimization discussion should ever be ;)]. > > > > > > > > Thanks, > > > Srinivas > > > > > > > > > > > > > > > > > > [Absolute benchmark results are unfortunately omitted from > > > > > this > > > > > letter > > > > > due to company policies, but the percent change and Student's > > > > > T > > > > > p-value are included above and in the referenced benchmark > > > > > results] > > > > > > > > > > The most obvious impact of this series will likely be the > > > > > overall > > > > > improvement in graphics performance on systems with an IGP > > > > > integrated > > > > > into the processor package (though for the moment this is > > > > > only > > > > > enabled > > > > > on BXT+), because the TDP budget shared among CPU and GPU can > > > > > frequently become a limiting factor in low-power devices. On > > > > > heavily > > > > > TDP-bound devices this series improves performance of > > > > > virtually any > > > > > non-trivial graphics rendering by a significant amount (of > > > > > the > > > > > order > > > > > of the energy efficiency improvement for that workload > > > > > assuming the > > > > > optimization didn't cause it to become non-TDP-bound). > > > > > > > > > > See [1]-[5] for detailed numbers including various graphics > > > > > benchmarks > > > > > and a sample of the Phoronix daily-system-tracker. Some > > > > > popular > > > > > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 > > > > > improve > > > > > between 5% and 11% on our systems. The exact improvement can > > > > > vary > > > > > substantially between systems (compare the benchmark results > > > > > from > > > > > the > > > > > two different J3455 systems [1] and [3]) due to a number of > > > > > factors, > > > > > including the ratio between CPU and GPU processing power, the > > > > > behavior > > > > > of the userspace graphics driver, the windowing system and > > > > > resolution, > > > > > the BIOS (which has an influence on the package TDP), the > > > > > thermal > > > > > characteristics of the system, etc. > > > > > > > > > > Unigine Valley and Heaven improve by a similar factor on some > > > > > systems > > > > > (see the J3455 results [1]), but on others the improvement is > > > > > lower > > > > > because the benchmark fails to fully utilize the GPU, which > > > > > causes > > > > > the > > > > > heuristic to remain in low-latency state for longer, which > > > > > leaves a > > > > > reduced TDP budget available to the GPU, which prevents > > > > > performance > > > > > from increasing further. This can be avoided by using the > > > > > alternative > > > > > heuristic parameters suggested in the commit message of PATCH > > > > > 8, > > > > > which > > > > > provide a lower IO utilization threshold and hysteresis for > > > > > the > > > > > controller to attempt to save energy. I'm not proposing > > > > > those for > > > > > upstream (yet) because they would also increase the risk for > > > > > latency-sensitive IO-heavy workloads to regress (like > > > > > SynMark2 > > > > > OglTerrainFly* and some arguably poorly designed IPC-bound > > > > > X11 > > > > > benchmarks). > > > > > > > > > > Discrete graphics aren't likely to experience that much of a > > > > > visible > > > > > improvement from this, even though many non-IGP workloads > > > > > *could* > > > > > benefit by reducing the system's energy usage while the > > > > > discrete > > > > > GPU > > > > > (or really, any other IO device) becomes a bottleneck, but > > > > > this is > > > > > not > > > > > attempted in this series, since that would involve making an > > > > > energy > > > > > efficiency/latency trade-off that only the maintainers of the > > > > > respective drivers are in a position to make. The cpufreq > > > > > interface > > > > > introduced in PATCH 1 to achieve this is left as an opt-in > > > > > for that > > > > > reason, only the i915 DRM driver is hooked up since it will > > > > > get the > > > > > most direct pay-off due to the increased energy budget > > > > > available to > > > > > the GPU, but other power-hungry third-party gadgets built > > > > > into the > > > > > same package (*cough* AMD *cough* Mali *cough* PowerVR > > > > > *cough*) may > > > > > be > > > > > able to benefit from this interface eventually by > > > > > instrumenting the > > > > > driver in a similar way. > > > > > > > > > > The cpufreq interface is not exclusively tied to the > > > > > intel_pstate > > > > > driver, because other governors can make use of the statistic > > > > > calculated as result to avoid over-optimizing for latency in > > > > > scenarios > > > > > where a lower frequency would be able to achieve similar > > > > > throughput > > > > > while using less energy. The interpretation of this > > > > > statistic > > > > > relies > > > > > on the observation that for as long as the system is CPU- > > > > > bound, any > > > > > IO > > > > > load occurring as a result of the execution of a program will > > > > > scale > > > > > roughly linearly with the clock frequency the program is run > > > > > at, so > > > > > (assuming that the CPU has enough processing power) a point > > > > > will be > > > > > reached at which the program won't be able to execute faster > > > > > with > > > > > increasing CPU frequency because the throughput limits of > > > > > some > > > > > device > > > > > will have been attained. Increasing frequencies past that > > > > > point > > > > > only > > > > > pessimizes energy usage for no real benefit -- The optimal > > > > > behavior > > > > > is > > > > > for the CPU to lock to the minimum frequency that is able to > > > > > keep > > > > > the > > > > > IO devices involved fully utilized (assuming we are past the > > > > > maximum-efficiency inflection point of the CPU's power-to- > > > > > frequency > > > > > curve), which is roughly the goal of this series. > > > > > > > > > > PELT could be a useful extension for this model since its > > > > > largely > > > > > heuristic assumptions would become more accurate if the IO > > > > > and CPU > > > > > load could be tracked separately for each scheduling entity, > > > > > but > > > > > this > > > > > is not attempted in this series because the additional > > > > > complexity > > > > > and > > > > > computational cost of such an approach is hard to justify at > > > > > this > > > > > stage, particularly since the current governor has similar > > > > > limitations. > > > > > > > > > > Various frequency and step-function response graphs are > > > > > available > > > > > in > > > > > [6]-[9] for comparison (obtained empirically on a BXT J3455 > > > > > system). > > > > > The response curves for the low-latency and low-power states > > > > > of the > > > > > heuristic are shown separately -- As you can see they roughly > > > > > bracket > > > > > the frequency response curve of the current governor. The > > > > > step > > > > > response of the aggressive heuristic is within a single > > > > > update > > > > > period > > > > > (even though it's not quite obvious from the graph with the > > > > > levels > > > > > of > > > > > zoom provided). I'll attach benchmark results from a slower > > > > > but > > > > > non-TDP-limited machine (which means there will be no TDP > > > > > budget > > > > > increase that could possibly mask a performance regression of > > > > > other > > > > > kind) as soon as they come out. > > > > > > > > > > Thanks to Eero and Valtteri for testing a number of > > > > > intermediate > > > > > revisions of this series (and there were quite a few of them) > > > > > in > > > > > more > > > > > than half a dozen systems, they helped spot quite a few > > > > > issues of > > > > > earlier versions of this heuristic. > > > > > > > > > > [PATCH 1/9] cpufreq: Implement infrastructure keeping track > > > > > of > > > > > aggregated IO active time. > > > > > [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs > > > > > with > > > > > core_funcs" > > > > > [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple > > > > > of long > > > > > names" > > > > > [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify > > > > > intel_pstate_adjust_pstate()" > > > > > [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util > > > > > from > > > > > pstate_funcs" > > > > > [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass > > > > > filtering controller for small core. > > > > > [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP > > > > > controller > > > > > based on ACPI FADT profile. > > > > > [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP > > > > > controller > > > > > parameters via debugfs. > > > > > [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO > > > > > activity > > > > > to cpufreq. > > > > > > > > > > [1] http://people.freedesktop.org/~currojerez/intel_pstate-lp > > > > > /bench > > > > > mark-perf-comparison-J3455.log > > > > > [2] http://people.freedesktop.org/~currojerez/intel_pstate-lp > > > > > /bench > > > > > mark-perf-per-watt-comparison-J3455.log > > > > > [3] http://people.freedesktop.org/~currojerez/intel_pstate-lp > > > > > /bench > > > > > mark-perf-comparison-J3455-1.log > > > > > [4] http://people.freedesktop.org/~currojerez/intel_pstate-lp > > > > > /bench > > > > > mark-perf-comparison-J4205.log > > > > > [5] http://people.freedesktop.org/~currojerez/intel_pstate-lp > > > > > /bench > > > > > mark-perf-comparison-J5005.log > > > > > [6] http://people.freedesktop.org/~currojerez/intel_pstate-lp > > > > > /frequ > > > > > ency-response-magnitude-comparison.svg > > > > > [7] http://people.freedesktop.org/~currojerez/intel_pstate-lp > > > > > /frequ > > > > > ency-response-phase-comparison.svg > > > > > [8] http://people.freedesktop.org/~currojerez/intel_pstate-lp > > > > > /step- > > > > > response-comparison-1.svg > > > > > [9] http://people.freedesktop.org/~currojerez/intel_pstate-lp > > > > > /step- > > > > > response-comparison-2.svg > > > > _______________________________________________ > > Intel-gfx mailing list > > Intel-gfx@xxxxxxxxxxxxxxxxxxxxx > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx