On Mon, Sep 14, 2015 at 09:00:51PM +0100, Steve Muckle wrote: > Hi Patrick, > > On 09/11/2015 04:09 AM, Patrick Bellasi wrote: > >> It's also worth noting that mobile vendors typically add all sorts of > >> hacks on top of the existing cpufreq governors which further complicate > >> policy. > > > > Could it be that many of the hacks introduced by vendors are just > > there to implement a kind of "scenario based" tuning of governors? > > I mean, depending on the specific use-case they try to refine the > > value of exposed tunables to improve either performance, > > responsiveness or power consumption? > > From what I've seen I think it's both scenario based tuning (add > functionality to detect and improve power/perf for say web browsing or > mp3 playback usecases specifically), as well as tailoring general case > behavior. Some of these are actually new features in the governor though > as opposed to just tweaks of existing tunables. > > > If this is the case, it means that the currently available governors > > are missing an important bit of information: what are the best > > tunables values for a specific (set of) tasks? > > Agreed, though I also think those tunable values might also change for a > given set of tasks in different circumstances. Could you provide an example? In my view the per-task support should be exploited just for quite specialized tasks, which are usually not subject to many different phases during their execution. For example, in a graphics rendering pipeline usually we have a host "controller" task and a set set of "worker" tasks running on the processing elements of the GPU. Since the controller task is usually low intensity, it does not generate on the CPU a load big enough to trigger the selection of an higher OPP. The main issue in this case is that running this task on a lower OPP could have sensible effects on latency affecting the performance of the whole graphics pipeline. For example, on Intel machines I was able to verify that running two OpenCL workloads concurrently on the same GPU gives better FPS than just running a single workload. And that it's mainly due to the selection of an higher OPP on the CPU side when two instances are running instead of just one. In these scenarios, the boosting of the CPU OPP when a specific task is runnable can help on getting better performance. > >> The current proposal: > >> > >> * sched-dvfs/schedtune: Event driven, CPU usage calculated using > >> exponential moving average. AFAICS tries to maintain some % of idle > >> headroom, but if that headroom doesn't exist at task_tick_fair(), goes > >> to max frequency. Schedtune provides a way to boost/inflate the demand > >> of individual tasks or overall system demand. > > > > That's quite of a good description. One small correction is that, at > > least in the implementation presented by this RFC, SchedTune is not > > boosting individual tasks but just the CPU usage. > > The link with tasks is just that SchedTune knows how much to boost a > > CPU usage by keeping track of which tasks are runnable on that CPU. > > However, the utilization signal of each task is not actually modified > > from the scheduler standpoint. > > Ah yes I see what you mean. I was thinking of the cgroup stuff but I see > that max per-task boost is tracked per-CPU and that CPU's aggregate > usage is boosted accordingly. Right, the idea is to have a sort of "boosting inheritance" mechanism. While two tasks, with two different boosting values, are concurrently runnable on a CPU, that CPU is boosted according to the max boost value for these two tasks. > >> This looks a bit like ondemand to me but without the > >> sampling_down_factor functionality and using per-entity load tracking > >> instead of a simple window-based aggregate CPU usage. > > > > I agree in principle. > > An important difference worth to notice is that we use an "event > > based" approach. This means that an enqueue/dequeue can trigger > > an immediate OPP change. > > If you consider that commonly ondemand uses a 20ms sample rate while > > an OPP switch never requires (quite likely) more than 1 or 2 ms, this > > means that sched-DVFS can be much more reactive on adapting to > > variable loads. > > "Can be" are the important words to me there... it'd be nice to be able > to control that. Aggressive frequency changes may not be desirable for > power or performance, even if the transition can be quickly completed. > The configuration values of min_sample_time and above_hispeed_delay in > the interactive governor on some recent devices may give clues as to > whether latency is being intentionally increased on various platforms. IMO these knobs are more like fixes for a too "coarse grained" solution. The main limitation of the current CPUFreq governors are: 1. use a single set of knobs to track many different tasks 2. use a system-wide view to control all tasks The solution we get is working but, of course, it is an "average" solution which satisfy on "average" the requirement of different tasks. With SchedTune we would like to get a similar result to the one you describe using min_sample_time and above_hispeed_delay by linking somehow the "interpretation" of the PELT signal with the boost value. Right now we have in sched-DVFS an idle % headroom which is hardcoded to be ~20% of the current OPP capacity. When we cross that boundary that threshold with the CPU usage, we switch straight to the max OPP. If we could figure out a proper mechanism to link the boost signal to both the idle % headroom and the target OPP, I think we could achieve quite similar results than what you can get with the knobs offered by the interactive governor. The more you boost a task the bigger is the idle % headroom and the higher is the OPP you will jump. > The latency/reactiveness of CPU frequency changes are also IMO a product > of two things - the CPUfreq/sched-dvfs policy, and the task load > tracking algorithm. I don't have enough experience with the mainline > task load tracking algorithm yet to know how it will compare with the > window-based aggregate CPU usage metric used by mainline cpufreq > governors. But I would imagine it will smooth out some of the aggressive > nature of sched-dvfs' event-driven approach. That's right, somehow the PELT signal has a dynamic which is well defined by the time constants it uses. Task enqueue/dequeue events could happen with a higher frequency dynamic, however these are only "check points" where the most updated value of a PELT signal could be used to take a decision. > The hardcoded values in the > task load tracking algorithm seem concerning though from a tuning > standpoint. I agree, that's why we are thinking about the solution described before. Exploit the boost value to replace the hardcoded thresholds should allow to get more flexibility while being per-task defined. Hopefully, tuning per task can be more easy and effective than selection a single value fitting all needs. > > >> The interactive functionality would require additional knobs. I > ... > > However, regarding specifically the latency on OPP changes, there are > > a couple of extension we was thinking about: > > 1. link the SchedTune boost value with the % of idle headroom which > > triggers an OPP increase > > 2. use the SchedTune boost value to defined the high frequency to jump > > at when a CPU crosses the % of idle headroom > > Hmmm... This may be useful (only testing/profiling would tell) though it > may be nice to be able to tune these values. Again, in my view the tuning should be per task with a single knob. The value of the knob should than be properly mapped on other internal values to obtain a well defined behavior driven by information shared with the scheduler, i.e. a PELT signal. > > These are tunables which allows to parameterize the way the PELT > > signal for CPU usage is interpreted by the sched-DVFS governor. > > > > How such tunables should be exposed and tuned is to be discussed. > > Indeed, one of the main goals of the sched-DVFS and SchedTune > > specifically, is to simplify the tuning of a platform by exposing to > > userspace a reduced number of tunables, preferably just one. > > This last point (the desire for a single tunable) is perhaps at the root > of my main concern. There are users/vendors for whom the current > tunables are insufficient, resulting in their hacking the governors to > add more tunables or features in the policy. We should also consider that we are proposing not only a single tunable but also a completely different standpoint. Not more a "blind" system-wide view on the average system behaviors, but instead a more detailed view on tasks behaviors. A single tunable used to "tag" tasks maybe it's not such a limited solution in this design. > Consolidating CPU frequency and idle management in the scheduler will > clean things up and probably make things more effective, but I don't > think it will remove the need for a highly configurable policy. This can be verified only by starting to use sched-DVFS + SchedTune on real/synthetic setup to verify which features are eventually missing, or specific use-cases not properly managed. If we are able to setup these experiments perhaps we will be able to identify a better design for a scheduler driver solution. > I'm curious about the drive for one tunable. Is that something there's > specifically been a broad call for? Don't get me wrong, I'm all for > simplification and cleanup, if the flexibility and used features can be > retained. All this thread [1] was somehow calling out for a solution which goes in the direction of a single tunable. The main idea is to exploit the current effort around EAS. While we are redesign some parts of the scheduler to be energy-ware it is convenient also to include in that design a knob which allows to configure how much we want to optimize for reduced power consumption or increased performance. > >> A separate but related concern - in the (IMO likely, given the above) > >> case that folks want to tinker with that policy, it now means they're > >> hacking the scheduler as opposed to a self-contained frequency policy > >> plugin. > > > > I do not agree on that point. SchedTune, as well as sched-DVFS, are > > framework quit well separated from the scheduler. > > They are "consumers" of signals usually used by the scheduler, but > > they are not directly affecting scheduler decisions (at least in the > > implementation proposed by this RFC). > > Agreed it's not affecting scheduler decision making (not directly). It's > more just the mixing of the policy into the same code, as margin is > added in enqueue_task_fair()/task_tick_fair() etc. That one in > particular would probably be easy to solve. A more difficult one is if > someone wants to make adjustments to the load tracking algorithm because > it is driving CPU frequency. That's not so straightforward. We have plenty of experience, collected on the past years, on CPUFreq governors and customer specific mods. Don't you think we can exploit that experience to reason around a fresh new design that allows to satisfy all requirements while providing possibly a simpler interface? I agree with you that all the current scenarios must be supported by the new proposal. We should probably start by listing them and come out with a set of test cases that allow to verify where we are wrt the state of the art. Tools and benchmarks to verify the proposals and measure the regress/progress should become more and more used. This is an even more important requirement to setup a common language and aims at objective evaluations. Moreover, it has been already required by scheduler maintainers in the past. > > Side effects are possible, of course. For example the selection of an > ... > > However, one of the main goals of this proposal is to respond to a > > couple of long lasting demands (e.g. [1,2]) for: > > 1. a better integration of CPUFreq with the scheduler, which has all > > the required knowledge about workloads demands to target both > > performances and energy efficiency > > 2. a simple approach to configure a system to care more about > > performance or energy-efficiency > > > > SchedTune addresses mainly the second point. Once SchedTune is > > integrated with EAS it will provide a support to decide, in an > > energy-efficient way, how much we want to reduce power or boost > > performances. > > The provided links definitely establish the need for (1) but I am still > wondering about the motivation for (2), because I don't think it's going > to be possible to boil everything down to a single slider tunable > without losing flexibility/functionality. I see and understand your concerns, still I'm on the idea that we should try to evaluate a different solution which possibly allows to simplify the user-space interface as well as to reduce the tuning effort. All that without scarifying the (measurable) efficiency of the final result. > cheers, > Steve > Thanks for this interesting discussion. Patrick [1] http://thread.gmane.org/gmane.linux.kernel/1236846/focus=1237796 -- #include <best/regards.h> Patrick Bellasi -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html