On Wednesday, July 22, 2020 1:14:42 AM CEST Francisco Jerez wrote: > > --==-=-= > Content-Type: multipart/mixed; boundary="=-=-=" > > --=-=-= > Content-Type: text/plain; charset=utf-8 > Content-Disposition: inline > Content-Transfer-Encoding: quoted-printable > > Srinivas Pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx> writes: > > > On Mon, 2020-07-20 at 16:20 -0700, Francisco Jerez wrote: > >> "Rafael J. Wysocki" <rafael@xxxxxxxxxx> writes: > >>=20 > >> > On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez < > >> > currojerez@xxxxxxxxxx> wrote: > >> > > "Rafael J. Wysocki" <rafael@xxxxxxxxxx> writes: > >> > >=20 > > {...] > > > >> > Overall, so far, I'm seeing a claim that the CPU subsystem can be > >> > made > >> > use less energy and do as much work as before (which is what > >> > improving > >> > the energy-efficiency means in general) if the maximum frequency of > >> > CPUs is limited in a clever way. > >> >=20 > >> > I'm failing to see what that clever way is, though. > >> Hopefully the clarifications above help some. > > > > To simplify: > > > > Suppose I called a function numpy.multiply() to multiply two big arrays > > and thread is a pegged to a CPU. Let's say it is causing CPU to > > finish the job in 10ms and it is using a P-State of 0x20. But the same > > job could have been done in 10ms even if it was using P-state of 0x16. > > So we are not energy efficient. To really know where is the bottle neck > > there are numbers of perf counters, may be cache was the issue, we > > could rather raise the uncore frequency a little. A simple APRF,MPERF > > counters are not enough.=20 > > Yes, that's right, APERF and MPERF aren't sufficient to identify every > kind of possible bottleneck, some visibility of the utilization of other > subsystems is necessary in addition -- Like e.g the instrumentation > introduced in my series to detect a GPU bottleneck. A bottleneck > condition in an IO device can be communicated to CPUFREQ It generally is not sufficient to communicate it to cpufreq. It needs to be communicated to the CPU scheduler. > by adjusting a > PM QoS latency request (link [2] in my previous reply) that effectively > gives the governor permission to rearrange CPU work arbitrarily within > the specified time frame (which should be of the order of the natural > latency of the IO device -- e.g. at least the rendering time of a frame > for a GPU) in order to minimize energy usage. OK, we need to talk more about this. > > or we characterize the workload at different P-states and set limits. > > I think this is not you want to say for energy efficiency with your > > changes.=20 > > > > The way you are trying to improve "performance" is by caller (device > > driver) to say how important my job at hand. Here device driver suppose > > offload this calculations to some GPU and can wait up to 10 ms, you > > want to tell CPU to be slow. But the p-state driver at a movement > > observes that there is a chance of overshoot of latency, it will > > immediately ask for higher P-state. So you want P-state limits based on > > the latency requirements of the caller. Since caller has more knowledge > > of latency requirement, this allows other devices sharing the power > > budget to get more or less power, and improve overall energy efficiency > > as the combined performance of system is improved. > > Is this correct? > > Yes, pretty much. OK