On Fri, May 8, 2020 at 1:19 PM Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote: > > On 07/05/2020 20:05, John Mathew wrote: > > [...] > > > diff --git a/Documentation/scheduler/cfs-overview.rst b/Documentation/scheduler/cfs-overview.rst > > new file mode 100644 > > index 000000000000..b717f2d3e340 > > --- /dev/null > > +++ b/Documentation/scheduler/cfs-overview.rst > > @@ -0,0 +1,113 @@ > > +.. SPDX-License-Identifier: GPL-2.0+ > > + > > +============= > > +CFS Overview > > +============= > > + > > +Linux 2.6.23 introduced a modular scheduler core and a Completely Fair > > +Scheduler (CFS) implemented as a scheduling module. A brief overview of the > > +CFS design is provided in :doc:`sched-design-CFS` > > + > > +In addition there have been many improvements to the CFS, a few of which are > > + > > +**Thermal Pressure**: > > +cpu_capacity initially reflects the maximum possible capacity of a CPU. > > +Thermal pressure on a CPU means this maximum possible capacity is > > +unavailable due to thermal events. Average thermal pressure for a CPU > > +is now subtracted from its maximum possible capacity so that cpu_capacity > > +reflects the remaining maximum capacity. > > + > > I agree with what Valentin mentioned already. Instead of describing > recent patch-sets, the functionality which was added (or enhanced) by > them) should be depicted instead. > > E.g. in case of 'Thermal Pressure' this would be the "scale CPU > capacity" mechanism for CFS so it knows how much CPU capacity is left > for its use after higher priority sched classes (RT, DL), IRQs and > 'Thermal Pressure' have reduced the 'original' CPU capacity. Changed in v4 patch version. > [...] > > > +**Load balancing algorithm Reworked**: > > +The load balancing algorithm contained some heuristics which became > > +meaningless since the rework of the scheduler's metrics like the > > +introduction of PELT. The new load balancing algorithm fixes several > > +pending wrong tasks placement > > + > > + * the 1 task per CPU case with asymmetric system > > + * the case of CFS task preempted by other class > > + * the case of tasks not evenly spread on groups with spare capacity > > + > > +Also the load balance decisions have been consolidated in the 3 separate > > +functions. > > What are those 3 separate functions? I guess you refer to the 3 > (actually 4) migration types (migrate_task, migrate_util, migrate_load, > (migrate_misfit)). The three functions are * update_sd_pick_busiest() select the busiest sched_group. * find_busiest_group() checks if there is an imbalance between local and busiest group. * calculate_imbalance() decides what have to be moved. Added them in v4 of patchset. > > [...] > > > diff --git a/Documentation/scheduler/overview.rst b/Documentation/scheduler/overview.rst > > index aee16feefc61..f2cb0c901208 100644 > > --- a/Documentation/scheduler/overview.rst > > +++ b/Documentation/scheduler/overview.rst > > @@ -3,3 +3,269 @@ > > ==================== > > Scheduler overview > > ==================== > > + > > +Linux kernel implements priority-based scheduling. More than one process are > > +allowed to run at any given time and each process is allowed to run as if it > > +were the only process on the system. The process scheduler coordinates which > > +process runs when. In that context, it has the following tasks: > > + > > +* share CPU cores equally among all currently running processes. > > +* pick appropriate process to run next if required, considering scheduling > > + class/policy and process priorities. > > +* balance processes between multiple cores in SMP systems. > > + > > +The scheduler attempts to be responsive for I/O bound processes and efficient > > +for CPU bound processes. The scheduler also applies different scheduling > > +policies for real time and normal processes based on their respective > > +priorities. Higher priorities in the kernel have a numerical smaller > > +value. Real time priorities range from 1 (highest) – 99 whereas normal > > +priorities range from 100 – 139 (lowest). SCHED_DEADLINE tasks have negative > > +priorities, reflecting the fact that any of them has higher priority than > > +RT and NORMAL/BATCH tasks. > > s/RT/SCHED_FIFO, SCHED_RR > s/NORMAL/SCHED_NORMAL > s/BATCH/SCHED_BATCH > Changed in patchset v4. > SCHED_IDLE tasks can be set in the 100 – 139 range too but IMHO are > treated as 139 (nice 20). Their priority doesn't matter since they get > minimal weight WEIGHT_IDLEPRI=3 anyway. > > And then there are the maintenance sched classes, idle sched class and > its idle tasks 'swapper/X' with priority 120 (was MAX_PRIO) as well as > the stop sched class and its stopper tasks 'migration/X' who disguise as > SCHED_FIFO with priority 139. > Might be that people might find this too detailed though but it helps > when you try to understand how it all works. > Added in patchset v4. > [...] > > > diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst > > index ede1a30a6894..f311abe5b711 100644 > > --- a/Documentation/scheduler/index.rst > > +++ b/Documentation/scheduler/index.rst > > @@ -17,10 +17,13 @@ specific implementation differences. > > :maxdepth: 2 > > > > overview > > + sched-data-structs > > + cfs-overview > > sched-design-CFS > > sched-features > > - arch-specific.rst > > - sched-debugging.rst > > + arch-specific > > + sched-debugging > > + scheduler-api > > > > .. only:: subproject and html > > > > + +------------------------------------+ > > + | TASK_RUNNING | > > + +---------------> | (Ready to run) | <--+ > > + | +------------------------------------+ | > > + | | | > > + | | schedule() calls context_switch() | task is preempted > > + | v | > > + | +------------------------------------+ | > > + | | TASK_RUNNING | | > > + | | (Running) | ---+ > > + | event occurred +------------------------------------+ > > + | | > > + | | task needs to wait for event > > + | v > > + | +------------------------------------+ > > + | | TASK_INTERRUPTIBLE | > > + | | TASK_UNINTERRUPTIBLE | > > + +-----------------| TASK_WAKEKILL | > > + +------------------------------------+ > > + | > > + | task exits via do_exit() > > + v > > + +------------------------------+ > > + | TASK_DEAD | > > + | EXIT_ZOMBIE | > > + +------------------------------+ > > + > > + > > +Scheduler provides tracepoints tracing all major events of the scheduler. > > +The tracepoints are defined in :: > > + > > + include/trace/events/sched.h > > + > > +Using these tracepoints it is possible to model the scheduler state transition > > I would refer to them as trace events. Changed in patchset v4. > > The scheduler started to export (bare) trace points for PELT and > overutilization (e.g. pelt_cfs_tp) (commit ba19f51fcb54 "sched/debug: > Add new tracepoints to track PELT at rq level"). They are not bound to a > trace event and so they don't expose any internal data structures. > > [...] > > > +Virtual Runtime > > +~~~~~~~~~~~~~~~~~ > > +Virtual Run Time or vruntime is the amount of time a task has spent running > > +on the CPU. It is updated periodically by scheduler_tick(). Tasks are stored > > +in the CFS scheduling class rbtree sorted by vruntime. scheduler_tick() calls > > +corresponding hook of CFS which first updates the runtime statistics of the > > +currently running task and checks if the current task needs to be preempted. > > +vruntime of the task based on the formula :: > > + > > + vruntime += delta_exec * (NICE_0_LOAD/curr->load.weight); > > + > > +where: > > + > > +* delta_exec is the time in nanoseconds spent by the task since the last time > > + vruntime was updated. > > +* NICE_0_LOAD is the load of a task with normal priority. > > +* curr is the shed_entity instance of the cfs_rq struct of the currently > > + running task. > > +* load.weight: sched_entity load_weight. load_weight is the encoding of > > + the tasks priority and vruntime. The load of a task is the metric > > + indicating the number of CPUs needed to make satisfactory progress on its > > + job. Load of a task influences the time a task spends on the CPU and also > > + helps to estimate the overall CPU load which is needed for load balancing. > > load.weight is replaced by PELT in load balancing. Removed this section as it is already described in sched-design-CFS > > > + Priority of the task is not enough for the scheduler to estimate the > > + vruntime of a process. So priority value must be mapped to the capacity of > > + the standard CPU which is done in the array :c:type:`sched_prio_to_weight[]`. > > + The array contains mappings for the nice values from -20 to 19. Nice value > > + 0 is mapped to 1024. Each entry advances by approximately 1.25 which means > > + for every increment in nice value the task gets 10% less CPU and vice versa. > > SCHED_IDLE get minimal weight (WEIGHT_IDLEPRIO=3) Thanks for your review. > > [...]