The topic of a single simple power-performance tunable, that is wholly scheduler centric, and has well defined and predictable properties has come up on several occasions in the past. With techniques such as a scheduler driven DVFS, which is now provided mainline via the schedutil governor, we now have a good framework for implementing such a tunable. This patch provides a detailed description of the motivations and design decisions behind the implementation of SchedTune. Cc: Jonathan Corbet <corbet@xxxxxxx> Cc: linux-doc@xxxxxxxxxxxxxxx Signed-off-by: Patrick Bellasi <patrick.bellasi@xxxxxxx> --- Documentation/scheduler/sched-tune.txt | 392 +++++++++++++++++++++++++++++++++ 1 file changed, 392 insertions(+) create mode 100644 Documentation/scheduler/sched-tune.txt diff --git a/Documentation/scheduler/sched-tune.txt b/Documentation/scheduler/sched-tune.txt new file mode 100644 index 0000000..da7b3eb --- /dev/null +++ b/Documentation/scheduler/sched-tune.txt @@ -0,0 +1,392 @@ + SchedTune Support for CFS Tasks + central, scheduler-driven, power-performance control + (EXPERIMENTAL) + +Abstract +======== + +The topic of a single simple power-performance tunable, that is wholly +scheduler centric and has well defined and predictable properties, has come up +on several occasions in the past [1,2]. With techniques such as a scheduler +driven DVFS [3], we now have a good framework for implementing such a tunable. + +Scheduler driven DVFS provides the foundation mechanism on top of which it's +possible to differentiate the performance levels based on the specific +requirements of different use-cases. For example, on mobile systems it's likely +to have demanding use-cases which may benefit from running on an higher OPP +than the one currently chosen by schedutil. + +In the past this behavior was provided by changing the governor or by tuning +its the many different parameters of non-mainline governors, now we can achieve +similar behaviors by tuning schedutil. This approach allows also for a more +fine grained control which can be extended to consider the requirement of +specific tasks. + +This document introduces SchedTune and describes the overall ideas behind its +design and implementation. + +Table of Contents +================= + +1. Motivation +2. Introduction +3. Signal Boosting Strategy +4. OPP selection using boosted CPU utilization +5. Per task group boosting +6. Question and Answers + - What about "auto" mode? + - What about boosting on a congested system? + - How CPUs are boosted when we have tasks with multiple boost values? +7. References + + +1. Motivation +============= + +schedutil [3] is a new event-driven cpufreq governor which allows the scheduler +to select the optimal DVFS Operating Performance Point (OPP) for running a task +allocated to a CPU. The introduction of schedutil enables running workloads at +the most efficient OPPs. + +However, sometimes it may be desired to intentionally boost the performance of +a workload even if that could imply a reasonable increase in energy +consumption. For example, in order to reduce the response time of a task, we +may want to run the task at a higher OPP than the one actually required by its +CPU bandwidth demand. + +This last requirement is especially important if we consider that schedutil can +potentially replace all currently available CPUFreq policies. Since schedutil +is event based, as opposed to the sampling driven governors, it is already more +responsive at selecting the optimal OPP to run tasks allocated to a CPU. +However, just tracking the actual task utilization demand may not be enough +from a performance standpoint. For example, it is not possible to get +behaviors similar to those provided by the "performance" and "powersave" +CPUFreq governors. + +This document describes an implementation of a tunable, stacked on top of +schedutil, which extends its functionality to support task performance +boosting. + +By "performance boosting" we mean the reduction of the time required to +complete a task activation, i.e. the time elapsed from a task wakeup to its +next deactivation (e.g. because it goes back to sleep or it terminates). +For example, if we consider a simple periodic task which executes the same +workload for 5[ms] every 20[ms] while running at a certain OPP, a boosted +execution of that task should be able to complete each of its activations in +less than 5[ms]. + +Previous attempts to introduce such a boosting feature has not been successful, +mainly because of the complexity of the proposed solution. The approach +described in this document exposes a single simple interface to user-space. +This single knob allows the tuning of system wide scheduler behaviours ranging +from energy efficiency at one end through to incremental performance boosting +at the other end. This tunable affects all tasks. A more advanced extension of +this concept is also provided, which uses CGroups to boost the performance of +selected tasks while using the default schedutil behaviors for all others. + +The rest of this document introduces in more details the proposed solution +which has been named SchedTune. + + +2. Introduction +=============== + +SchedTune exposes a simple user-space interface with a single power-performance +tunable: + + /proc/sys/kernel/sched_cfs_boost + +This permits expressing a boost value as an integer in the range [0..100]. + +A value of 0 (default) for a CFS task means that schedutil will attempt +to match compute capacity of the CPU where the task is scheduled to +match its current utilization with a few spare cycles left. A value of +100 means that schedutil will select the highest available OPP. + +The range between 0 and 100 can be set to satisfy other scenarios suitably. +For example to satisfy interactive response or depending on other system events +(battery level, thermal status, etc). + +A CGroup based extension is also provided, which permits further user-space +defined task classification to tune the scheduler for different goals depending +on the specific nature of the task, e.g. background vs interactive vs +low-priority. + +The overall design of the SchedTune module is built on top of "Per-Entity Load +Tracking" (PELT) signals and schedutil by introducing a bias on the Operating +Performance Point (OPP) selection. Each time a task is allocated on a CPU, +schedutil has the opportunity to tune the operating frequency of that CPU to +better match the workload demand. The selection of the actual OPP being +activated is influenced by the global boost value, or the boost value for the +task CGroup when in use. + +This simple biasing approach leverages existing frameworks, which means minimal +modifications to the scheduler, and yet it allows to achieve a range of +different behaviours all from a single simple tunable knob. The only new +concept introduced is that of signal boosting. + + +3. Signal Boosting Strategy +=========================== + +The whole PELT machinery works based on the value of a few utilization tracking +signals which basically track the CPU bandwidth requirements for tasks and the +capacity of CPUs. The basic idea behind the SchedTune knob is to artificially +inflate some of these utilization tracking signals to make a task or RQ appears +more demanding than it actually is. + +Which signals have to be inflated depends on the specific "consumer". However, +independently from the specific (signal, consumer) pair, it is important to +define a simple and possibly consistent strategy for the concept of boosting a +signal. + +A boosting strategy defines how the "abstract" user-space defined +sched_cfs_boost value is translated into an internal "margin" value to be added +to a signal to get its inflated value: + + margin := boosting_strategy(sched_cfs_boost, signal) + boosted_signal := signal + margin + +Different boosting strategies were identified and analyzed before selecting the +one found to be most effective. The next section describes the details of this +boosting strategy. + +Signal Proportional Compensation (SPC) +-------------------------------------- + +In this boosting strategy the sched_cfs_boost value is used to compute a margin +which is proportional to the complement of the original signal. This +complement is defined as the delta from the actual value of a signal and its +possible maximum value. + +Since the tunable implementation uses signals which have SCHED_CAPACITY_SCALE +as the maximum possible value, the margin becomes: + + margin := sched_cfs_boost * (SCHED_CAPACITY_SCALE - signal) + +Using this boosting strategy: +- a 100% sched_cfs_boost means that the signal is scaled to the maximum value +- each value in the range of sched_cfs_boost effectively inflates the signal in + question by a quantity which is proportional to the maximum value. + +For example, by applying the SPC boosting strategy to the selection of the OPP +to run a task it is possible to achieve these behaviors: + +- 0% boosting: run the task at the minimum OPP required by its workload +- 100% boosting: run the task at the maximum OPP available for the CPU +- 50% boosting: run at the half-way OPP between minimum and maximum + +Which means that, at 50% boosting, a task will be scheduled to run at half of +the maximum theoretically achievable performance on the specific target +platform. + +For example, assuming a 50% task which runs for 8ms every 16ms, +ignoring any margins built into schedutil we should get: + + Boost OPP Expected Completion Time + 0% 50% 16.00 ms + 50% 75% 16*50/75 = 10.67 ms + 100% 100% 8.00 ms + +The reduction of the completion time of boosting 100% instead of 50% is +only 25%. + +A graphical representation of an SPC boosted signal is represented in the +following figure where: + a) "-" represents the original signal + b) "b" represents a 50% boosted signal + c) "p" represents a 100% boosted signal + + + | SCHED_CAPACITY_SCALE + +-----------------------------------------------------------------+ + |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp + | + | boosted_signal + | bbbbbbbbbbbbbbbbbbbbbbbb + | + | original signal + | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+ + | | + |bbbbbbbbbbbbbbbbbb | + | | + | | + | | + | +-----------------------+ + | | + | | + | | + |------------------+ + | + | + +-----------------------------------------------------------------------> + +The plot above shows a ramped utilization signal (titled 'original_signal') and +its boosted equivalent. For each step of the original signal the boosted signal +corresponding to a 50% boost is midway from the original signal and the upper +bound. Boosting by 100% generates a boosted signal which is always saturated to +the upper bound. + + +4. OPP selection using boosted CPU utilization +============================================== + +It is worth calling out that the implementation does not introduce any new +utilization signals. Instead, it provides an API to tune existing signals. This +tuning is done on demand and only in scheduler code paths where it is sensible +to do so. The new API calls are defined to return either the default signal or +a boosted one, depending on the value of sched_cfs_boost. This is a clean and +non invasive modification of the existing existing code paths. + +The signal representing a CPU's utilization is boosted according to the +previously described SPC boosting strategy. To schedutil, this allows a CPU +(ie CFS run-queue) to appear more used then it actually is. + +Thus, with the sched_cfs_boost enabled we have the following main functions to +get the current utilization of a CPU: + + cpu_util() + boosted_cpu_util() + +The new boosted_cpu_util() is similar to the first but returns a boosted +utilization signal which is a function of the sched_cfs_boost value. + +This function is used in the CFS scheduler code paths where schedutil needs to +decide the OPP to run a CPU at. +For example, this allows selecting the highest OPP for a CPU which has +the boost value set to 100%. + + +5. Per task group boosting +========================== + +The availability of a single knob which is used to boost all tasks in the +system is certainly a simple solution but it quite likely doesn't fit many +use-case, especially in the mobile device space. + +For example, on battery powered devices there usually are many background +services which are long running and need energy efficient scheduling. On the +other hand, some applications are more performance sensitive and require an +interactive response and/or maximum performance, regardless of the energy cost. +To better service such scenarios, the SchedTune implementation has an extension +that provides a more fine grained boosting interface. + +A new CGroup controller, namely "schedtune", could be enabled which allows to +defined and configure task groups with different boosting values. Tasks that +require special performance can be put into separate CGroups. The value of the +boost associated with the tasks in this group can be specified using a single +knob exposed by the CGroup controller: + + schedtune.boost + +This knob allows the definition of a boost value that is to be used for +SPC boosting of all tasks attached to this group. + +The current schedtune controller implementation is really simple and has these +main characteristics: + + 1) It is only possible to create 1 level depth hierarchies + + The root control groups define the system-wide boost value to be applied + by default to all tasks. Its direct subgroups are named "boost groups" and + they define the boost value for specific set of tasks. + Further nested subgroups are not allowed since they do not have a sensible + meaning from a user-space standpoint. + + 2) It is possible to define only a limited number of "boost groups" + + This number is defined at compile time and by default configured to 16. + This is a design decision motivated by two main reasons: + a) In a real system we do not expect use-cases with more then few + boost groups. For example, a reasonable collection of groups could be + just "background", "interactive" and "performance". + b) It simplifies the implementation considerably, especially for the code + which has to compute the per CPU boosting once there are multiple + RUNNABLE tasks with different boost values. + +Such a simple design should allow servicing the main utilization scenarios +identified so far. It provides a simple interface which can be used to manage +the power-performance of all tasks or only selected tasks. Moreover, this +interface can be easily integrated by user-space run-times (e.g. Android, +ChromeOS) to implement a QoS solution for task boosting based on tasks +classification, which has been a long standing requirement. + +Setup and usage +--------------- + +0. Use a kernel with CGROUP_SCHED_TUNE support enabled + +1. Check that the "schedtune" CGroup controller is available: + + root@derkdell:~# cat /proc/cgroups + #subsys_name hierarchy num_cgroups enabled + cpuset 0 1 1 + cpu 0 1 1 + schedtune 0 1 1 + +2. Mount a tmpfs to create the CGroups mount point (Optional) + + root@derkdell:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup + +3. Mount the "schedtune" controller + + root@derkdell:~# mkdir /sys/fs/cgroup/stune + root@derkdell:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune + +4. Setup the system-wide boost value (Optional) + + If not configured the root control group has a 0% boost value, which + basically disables boosting for all tasks in the system thus running in + an energy-efficient mode. Let assume SYSBOOST defines the default boost + value to be used for all tasks: + + root@derkdell:~# echo $SYSBOOST > /sys/fs/cgroup/stune/schedtune.boost + +5. Create task groups and configure their specific boost value (Optional) + + For example here we create a "performance" boost group configure to boost + all its tasks to 100% + + root@derkdell:~# mkdir /sys/fs/cgroup/stune/performance + root@derkdell:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost + +6. Move tasks into the boost group + + For example, the following moves the tasks with PID $TASKPID (and all its + tasks) into the "performance" boost group. + + root@derkdell:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs + +This simple configuration allows only the tasks of the $TASKPID task to run, +when needed, at the highest OPP in the most capable CPU of the system. + + +6. Question and Answers +======================= + +What about "auto" mode? +----------------------- + +The 'auto' mode as described in previous propose approaches can be implemented +by interfacing SchedTune with some suitable user-space element. This element +could use the exposed system-wide or cgroup based interface. + +How are multiple groups of tasks with different boost values managed? +--------------------------------------------------------------------- + +The current SchedTune implementation keeps track of the boosted RUNNABLE tasks +on a CPU. Once schedutil selects the OPP for a CPU, the utilization of that CPU +is boosted with a value which is the maximum of the boost values of all the +currently RUNNABLE tasks in that CPU. + +This allows schedutil to boost a CPU only while there are boosted tasks ready +to run and switch back to the energy efficient mode as soon as the last boosted +task is dequeued. + + +7. References +============= +[1] http://lwn.net/Articles/552889 +[2] http://lkml.org/lkml/2012/5/18/91 +[3] http://lkml.org/lkml/2016/3/16/559 + -- 2.10.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html