Re: [RFC PATCH v3 2/3] docs: scheduler: Add scheduler overview documentation

John Mathew <john.mathew@xxxxxxxxxx> · Wed, 13 May 2020 16:38:06 +0300



On Fri, May 8, 2020 at 1:19 PM Dietmar Eggemann
<dietmar.eggemann@xxxxxxx> wrote:
>
> On 07/05/2020 20:05, John Mathew wrote:
>
> [...]
>
> > diff --git a/Documentation/scheduler/cfs-overview.rst b/Documentation/scheduler/cfs-overview.rst
> > new file mode 100644
> > index 000000000000..b717f2d3e340
> > --- /dev/null
> > +++ b/Documentation/scheduler/cfs-overview.rst
> > @@ -0,0 +1,113 @@
> > +.. SPDX-License-Identifier: GPL-2.0+
> > +
> > +=============
> > +CFS Overview
> > +=============
> > +
> > +Linux 2.6.23 introduced a modular scheduler core and a Completely Fair
> > +Scheduler (CFS) implemented as a scheduling module. A brief overview of the
> > +CFS design is provided in :doc:`sched-design-CFS`
> > +
> > +In addition there have been many improvements to the CFS, a few of which are
> > +
> > +**Thermal Pressure**:
> > +cpu_capacity initially reflects the maximum possible capacity of a CPU.
> > +Thermal pressure on a CPU means this maximum possible capacity is
> > +unavailable due to thermal events. Average thermal pressure for a CPU
> > +is now subtracted from its maximum possible capacity so that cpu_capacity
> > +reflects the remaining maximum capacity.
> > +
>
> I agree with what Valentin mentioned already. Instead of describing
> recent patch-sets, the functionality which was added (or enhanced) by
> them) should be depicted instead.
>
> E.g. in case of 'Thermal Pressure' this would be the "scale CPU
> capacity" mechanism for CFS so it knows how much CPU capacity is left
> for its use after higher priority sched classes (RT, DL), IRQs and
> 'Thermal Pressure' have reduced the 'original' CPU capacity.
Changed in v4 patch version.
> [...]
>
> > +**Load balancing algorithm Reworked**:
> > +The load balancing algorithm contained some heuristics which became
> > +meaningless since the rework of the scheduler's metrics like the
> > +introduction of PELT. The new load balancing algorithm fixes several
> > +pending wrong tasks placement
> > +
> > + * the 1 task per CPU case with asymmetric system
> > + * the case of CFS task preempted by other class
> > + * the case of tasks not evenly spread on groups with spare capacity
> > +
> > +Also the load balance decisions have been consolidated in the 3 separate
> > +functions.
>
> What are those 3 separate functions? I guess you refer to the 3
> (actually 4) migration types (migrate_task, migrate_util, migrate_load,
> (migrate_misfit)).

The three functions are
* update_sd_pick_busiest() select the busiest sched_group.
* find_busiest_group() checks if there is an imbalance between local and
busiest group.
* calculate_imbalance() decides what have to be moved.
Added them in v4 of patchset.
>
> [...]
>
> > diff --git a/Documentation/scheduler/overview.rst b/Documentation/scheduler/overview.rst
> > index aee16feefc61..f2cb0c901208 100644
> > --- a/Documentation/scheduler/overview.rst
> > +++ b/Documentation/scheduler/overview.rst
> > @@ -3,3 +3,269 @@
> >  ====================
> >  Scheduler overview
> >  ====================
> > +
> > +Linux kernel implements priority-based scheduling. More than one process are
> > +allowed to run at any given time and each process is allowed to run as if it
> > +were the only process on the system. The process scheduler coordinates which
> > +process runs when. In that context, it has the following tasks:
> > +
> > +* share CPU cores equally among all currently running processes.
> > +* pick appropriate process to run next if required, considering scheduling
> > +  class/policy and process priorities.
> > +* balance processes between multiple cores in SMP systems.
> > +
> > +The scheduler attempts to be responsive for I/O bound processes and efficient
> > +for CPU bound processes. The scheduler also applies different scheduling
> > +policies for real time and normal processes based on their respective
> > +priorities. Higher priorities in the kernel have a numerical smaller
> > +value. Real time priorities range from 1 (highest) – 99 whereas normal
> > +priorities range from 100 – 139 (lowest). SCHED_DEADLINE tasks have negative
> > +priorities, reflecting the fact that any of them has higher priority than
> > +RT and NORMAL/BATCH tasks.
>
> s/RT/SCHED_FIFO, SCHED_RR
> s/NORMAL/SCHED_NORMAL
> s/BATCH/SCHED_BATCH
>
Changed in patchset v4.

> SCHED_IDLE tasks can be set in the 100 – 139 range too but IMHO are
> treated as 139 (nice 20). Their priority doesn't matter since they get
> minimal weight WEIGHT_IDLEPRI=3 anyway.
>
> And then there are the maintenance sched classes, idle sched class and
> its idle tasks 'swapper/X' with priority 120 (was MAX_PRIO) as well as
> the stop sched class and its stopper tasks 'migration/X' who disguise as
> SCHED_FIFO with priority 139.
> Might be that people might find this too detailed though but it helps
> when you try to understand how it all works.
>
Added in patchset v4.
> [...]
>
> > diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
> > index ede1a30a6894..f311abe5b711 100644
> > --- a/Documentation/scheduler/index.rst
> > +++ b/Documentation/scheduler/index.rst
> > @@ -17,10 +17,13 @@ specific implementation differences.
> >      :maxdepth: 2
> >
> >      overview
> > +    sched-data-structs
> > +    cfs-overview
> >      sched-design-CFS
> >      sched-features
> > -    arch-specific.rst
> > -    sched-debugging.rst
> > +    arch-specific
> > +    sched-debugging
> > +    scheduler-api
> >
> >  .. only::  subproject and html
> >
> > +                     +------------------------------------+
> > +                     |            TASK_RUNNING            |
> > +   +---------------> |           (Ready to run)           | <--+
> > +   |                 +------------------------------------+    |
> > +   |                   |                                       |
> > +   |                   | schedule() calls context_switch()     | task is preempted
> > +   |                   v                                       |
> > +   |                 +------------------------------------+    |
> > +   |                 |            TASK_RUNNING            |    |
> > +   |                 |             (Running)              | ---+
> > +   | event occurred  +------------------------------------+
> > +   |                   |
> > +   |                   | task needs to wait for event
> > +   |                   v
> > +   |                 +------------------------------------+
> > +   |                 |         TASK_INTERRUPTIBLE         |
> > +   |                 |        TASK_UNINTERRUPTIBLE        |
> > +   +-----------------|           TASK_WAKEKILL            |
> > +                     +------------------------------------+
> > +                                       |
> > +                                       | task exits via do_exit()
> > +                                       v
> > +                        +------------------------------+
> > +                        |          TASK_DEAD           |
> > +                        |         EXIT_ZOMBIE          |
> > +                        +------------------------------+
> > +
> > +
> > +Scheduler provides tracepoints tracing all major events of the scheduler.
> > +The tracepoints are defined in ::
> > +
> > +  include/trace/events/sched.h
> > +
> > +Using these tracepoints it is possible to model the scheduler state transition
>
> I would refer to them as trace events.
Changed in patchset v4.
>
> The scheduler started to export (bare) trace points for PELT and
> overutilization (e.g. pelt_cfs_tp) (commit ba19f51fcb54 "sched/debug:
> Add new tracepoints to track PELT at rq level"). They are not bound to a
> trace event and so they don't expose any internal data structures.
>
> [...]
>
> > +Virtual Runtime
> > +~~~~~~~~~~~~~~~~~
> > +Virtual Run Time or vruntime is the amount of time a task has spent running
> > +on the CPU. It is updated periodically by scheduler_tick(). Tasks are stored
> > +in the CFS scheduling class rbtree sorted by vruntime. scheduler_tick() calls
> > +corresponding hook of CFS which first updates the runtime statistics of the
> > +currently running task and checks if the current task needs to be preempted.
> > +vruntime of the task based on the formula ::
> > +
> > +    vruntime += delta_exec * (NICE_0_LOAD/curr->load.weight);
> > +
> > +where:
> > +
> > +* delta_exec is the time in nanoseconds spent by the task since the last time
> > +  vruntime was updated.
> > +* NICE_0_LOAD is the load of a task with normal priority.
> > +* curr is the shed_entity instance of the cfs_rq struct of the currently
> > +  running task.
> > +* load.weight: sched_entity load_weight. load_weight is the encoding of
> > +  the tasks priority and vruntime. The load of a task is the metric
> > +  indicating the number of CPUs needed to make satisfactory progress on its
> > +  job. Load of a task influences the time a task spends on the CPU and also
> > +  helps to estimate the overall CPU load which is needed for load balancing.
>
> load.weight is replaced by PELT in load balancing.
Removed this section as it is already described in sched-design-CFS
>
> > +  Priority of the task is not enough for the scheduler to estimate the
> > +  vruntime of a process. So priority value must be mapped to the capacity of
> > +  the standard CPU which is done in the array :c:type:`sched_prio_to_weight[]`.
> > +  The array contains mappings for the nice values from -20 to 19. Nice value
> > +  0 is mapped to 1024. Each entry advances by approximately 1.25 which means
> > +  for every increment in nice value the task gets 10% less CPU and vice versa.
>
> SCHED_IDLE get minimal weight (WEIGHT_IDLEPRIO=3)

Thanks for your review.
>
> [...]