On Fri, May 8, 2020 at 12:15 AM Valentin Schneider <valentin.schneider@xxxxxxx> wrote: > > > On 07/05/20 19:05, John Mathew wrote: > > Add documentation for > > -scheduler overview > > -scheduler state transtion > > -CFS overview > > -scheduler data structs > > > > Add rst for scheduler APIs and modify sched/core.c > > to add kernel-doc comments. > > > > Suggested-by: Lukas Bulwahn <lukas.bulwahn@xxxxxxxxx> > > Co-developed-by: Mostafa Chamanara <mostafa.chamanara@xxxxxxxxxxxx> > > Signed-off-by: Mostafa Chamanara <mostafa.chamanara@xxxxxxxxxxxx> > > Co-developed-by: Oleg Tsymbal <oleg.tsymbal@xxxxxxxxxx> > > Signed-off-by: Oleg Tsymbal <oleg.tsymbal@xxxxxxxxxx> > > Signed-off-by: John Mathew <john.mathew@xxxxxxxxxx> > > --- > > Documentation/scheduler/cfs-overview.rst | 113 ++++++++ > > Documentation/scheduler/index.rst | 7 +- > > Documentation/scheduler/overview.rst | 266 ++++++++++++++++++ > > .../scheduler/sched-data-structs.rst | 253 +++++++++++++++++ > > Documentation/scheduler/scheduler-api.rst | 31 ++ > > kernel/sched/core.c | 28 +- > > kernel/sched/sched.h | 169 ++++++++++- > > 7 files changed, 858 insertions(+), 9 deletions(-) > > create mode 100644 Documentation/scheduler/cfs-overview.rst > > create mode 100644 Documentation/scheduler/sched-data-structs.rst > > create mode 100644 Documentation/scheduler/scheduler-api.rst > > > > diff --git a/Documentation/scheduler/cfs-overview.rst b/Documentation/scheduler/cfs-overview.rst > > new file mode 100644 > > index 000000000000..b717f2d3e340 > > --- /dev/null > > +++ b/Documentation/scheduler/cfs-overview.rst > > @@ -0,0 +1,113 @@ > > +.. SPDX-License-Identifier: GPL-2.0+ > > + > > +============= > > +CFS Overview > > +============= > > + > > +Linux 2.6.23 introduced a modular scheduler core and a Completely Fair > > +Scheduler (CFS) implemented as a scheduling module. A brief overview of the > > +CFS design is provided in :doc:`sched-design-CFS` > > + > > +In addition there have been many improvements to the CFS, a few of which are > > <snip> > > > +**Consider misfit tasks when load-balancing**: > > +On asymmetric CPU capacity systems load intensive tasks can end up on > > +CPUs that don't suit their compute demand. In this scenario 'misfit' > > +tasks are migrated to CPUs with higher compute capacity to ensure better > > +throughput. A new group_type: group_misfit_task is added and indicates this > > +scenario. Tweaks to the load-balance code are done to make the migrations > > +happen. Misfit balancing is done between a source group of lower per-CPU > > +capacity and destination group of higher compute capacity. Otherwise, misfit > > +balancing is ignored. > > + > > I'm not sure how useful this cherry-picked git history is, but that's > making me think we may want to have an asymmetric CPU capacity scheduling > section somewhere - I've had to explain what we do here a few times > already, and it's true that unless you look at the git history or at the > code it isn't laid out for you (with the exception of EAS that was blessed > with Quentin's writings). Added a new rst for Capacity Aware scheduling. It contains a brief description according to Dietmar. That will be a good place for you to contribute. > > It would also be an opportunity to have one place to (at least briefly) > describe what the different sched classes do wrt capacity asymmetry - CFS > does one thing, RT now does one thing (see Qais' work), and DL will > hopefully soon follow (see Dietmar's work). > > I'd be happy to contribute (some of) that, if it can be deemed useful (I > personally think it might). > > > +========================= > > +Scheduler Data Structures > > +========================= > > + > > +The main parts of the Linux scheduler are: > > + > > +Runqueue > > +~~~~~~~~ > > + > > +:c:type:`struct rq <rq>` is the central data structure of process > > +scheduling. It keeps track of tasks that are in a runnable state assigned > > +for a particular processor. Each CPU has its own run queue and stored in a > > +per CPU array:: > > + > > + DEFINE_PER_CPU(struct rq, runqueues); > > + > > +Access to the queue requires locking and lock acquire operations must be > > +ordered by ascending runqueue. Macros for accessing and locking the runqueue > > +are provided in:: > > + > > + kernel/sched/sched.h > > + > > +The runqueue contains scheduling class specific queues and several scheduling > > +statistics. > > + > > +Scheduling entity > > +~~~~~~~~~~~~~~~~~ > > +Scheduler uses scheduling entities which contain sufficient information to > > +actually accomplish the scheduling job of a task or a task-group. The > > +scheduling entity may be a group of tasks or a single task. Every task is > > +associated with a sched_entity structure. CFS adds support for nesting of > > +tasks and task groups. Each scheduling entity may be run from its parents > > +runqueue. The scheduler traverses the sched_entity hierarchy to pick the > > +next task to run on the CPU. The entity gets picked up from the cfs_rq on > > +which it is queued and its time slice is divided among all the tasks on its my_q. > > + > > +Virtual Runtime > > +~~~~~~~~~~~~~~~~~ > > That probably should be under CFS specific stuff. Agree, infact this is already described in sched-design-CFS.rst. > > > +Virtual Run Time or vruntime is the amount of time a task has spent running > > +on the CPU. It is updated periodically by scheduler_tick(). Tasks are stored > > +in the CFS scheduling class rbtree sorted by vruntime. scheduler_tick() calls > > +corresponding hook of CFS which first updates the runtime statistics of the > > +currently running task and checks if the current task needs to be preempted. > > +vruntime of the task based on the formula :: > > + > > + vruntime += delta_exec * (NICE_0_LOAD/curr->load.weight); > > + > > +where: > > + > > +* delta_exec is the time in nanoseconds spent by the task since the last time > > + vruntime was updated. > > There's the whole task_clock() shenanigans there, i.e. depending on your > config knobs that delta may not include IRQ or paravirt time; though that > doesn't necessarily help in understanding vruntime, so it might be best to > leave it at that. > > > +* NICE_0_LOAD is the load of a task with normal priority. > > s/normal/default/ > > > +* curr is the shed_entity instance of the cfs_rq struct of the currently > > + running task. > > +* load.weight: sched_entity load_weight. load_weight is the encoding of > > + the tasks priority and vruntime. The load of a task is the metric > > + indicating the number of CPUs needed to make satisfactory progress on its > > + job. Load of a task influences the time a task spends on the CPU and also > > + helps to estimate the overall CPU load which is needed for load balancing. > > Careful not to mix up se.load and se.avg.load_avg. Load balancing decisions > are based (amongst other things) on (sums of) se.avg.load_avg, i.e. the > 'load' signal generated via PELT. It is indeed weighted by priority, but > se.load is *not* that load signal. > > It's also the first time I see that bit about load being ~how many CPUs > that task needs. That's peculiar; you can't really split a task up to > schedule it on other CPUs, so how would that help? And, again, the load > signal being weighted by priority makes this even weirder. This section was removed as sched-design-CFS.rst describes it. > > > + Priority of the task is not enough for the scheduler to estimate the > > + vruntime of a process. So priority value must be mapped to the capacity of > > + the standard CPU which is done in the array :c:type:`sched_prio_to_weight[]`. > > + The array contains mappings for the nice values from -20 to 19. Nice value > > + 0 is mapped to 1024. Each entry advances by approximately 1.25 which means > > + for every increment in nice value the task gets 10% less CPU and vice versa. > > + > > +Scheduler classes > > +~~~~~~~~~~~~~~~~~ > > +It is an extensible hierarchy of scheduler modules. The modules encapsulate > > +scheduling policy details. They are called from the core code which is > > +independent. Scheduling classes are implemented through the sched_class > > +structure. dl_sched_class for deadline scheduler, fair_sched_class for CFS > > +and rt_sched_class for RT are implementations of this class. > > There's stop (at the top) and idle (at the bottom), but admittedly they > are somewhat special. Added in v4 patch version. > > > + > > +The important methods of scheduler class are: > > + > > +enqueue_task and dequeue_task > > + These functions are used to put and remove tasks from the runqueue > > + respectively. The function takes the runqueue, the task which needs to > > + be enqueued/dequeued and a bit mask of flags. The main purpose of the > > + flags is to describe why the enqueue or dequeue is being called. > > + The different flags used are described in :: > > + > > + kernel/sched/sched.h > > + > > + enqueue_task and dequeue_task are called for following purposes. > > + > > + - When waking a newly created task for the first time. Called with > > + ENQUEUE_NOCLOCK > > + - When migrating a task from one CPU's runqueue to another. Task will be > > + first dequeued from its old runqueue, new CPU will be added to the > > + task struct, runqueue of the new CPU will be retrieved and task is > > + then enqueued on this new runqueue. > > > > + - When do_set_cpus_allowed() is called to change a tasks CPU affinity. If > > + the task is queued on a runqueue, it is first dequeued with the > > + DEQUEUE_SAVE and DEQUEUE_NOCLOCK flags set. The set_cpus_allowed() > > + function of the corresponding scheduling class will be called. > > + enqueue_task() is then called with ENQUEUE_RESTORE and ENQUEUE_NOCLOCK > > + flags set. > > + - When changing the priority of a task using rt_mutex_setprio(). This > > + function implements the priority inheritance logic of the RT mutex > > + code. This function changes the effective priority of a task which may > > + in turn change the scheduling class of the task. If so enqueue_task is > > + called with flags corresponding to each class. > > + - When user changes the nice value of the task. If the task is queued on > > + a runqueue, it first needs to be dequeued, then its load weight and > > + effective priority needs to be set. Following which the task is > > + enqueued with ENQUEUE_RESTORE and ENQUEUE_NOCLOCK flags set. > > + - When __sched_setscheduler() is called. This function enables changing > > + the scheduling policy and/or RT priority of a thread. If the task is > > + on a runqueue, it will be first dequeued, changes will be made and > > + then enqueued. > > + - When moving tasks between scheduling groups. The runqueue of the tasks > > + is changed when moving between groups. For this purpose if the task > > + is running on a queue, it is first dequeued with DEQUEUE_SAVE, DEQUEUE_MOVE > > + and DEQUEUE_NOCLOCK flags set, followed by which scheduler function to > > + change the tsk->se.cfs_rq and tsk->se.parent and then task is enqueued > > + on the runqueue with the same flags used in dequeue. > > + > > Those are all what Peter dubs the "change pattern", you may want to just > explain the principle (we need to dequeue tasks to do some operations; and > ofc we need to put them back as they were once that's done) & pick just one > rather than exhaustively list them all. Removed describing change pattern in multiple places. > > This still applies to a lot of other places IMO - I think you need less > exhaustive listings and more "what's the gist of that". > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index db3a57675ccf..21f2953b72c7 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -865,12 +865,175 @@ struct uclamp_rq { > > }; > > #endif /* CONFIG_UCLAMP_TASK */ > > > > -/* > > - * This is the main, per-CPU runqueue data structure. > > +/** > > + * struct rq - This is the main, per-CPU runqueue data structure. > > * > > * Locking rule: those places that want to lock multiple runqueues > > * (such as the load balancing or the thread migration code), lock > > * acquire operations must be ordered by ascending &runqueue. > > + * > > + * @lock: > > + * lock to be acquired while modifying the runqueue > > + * @nr_running: > > + * number of runnable tasks on this queue > > + * @nr_numa_running: > > + * number of tasks running that care about their placement > > + * @nr_preferred_running: > > + * number of tasks that are optimally NUMA placed > > + * @numa_migrate_on: > > + * per run-queue variable to check if NUMA-balance is > > + * active on the run-queue > > + * @last_blocked_load_update_tick: > > + * tick stamp for decay of blocked load > > + * @has_blocked_load: > > + * idle CPU has blocked load > > + * @nohz_tick_stopped: > > + * CPU is going idle with tick stopped > > + * @nohz_flags: > > + * flags indicating NOHZ idle balancer actions > > + * @nr_load_updates: > > + * unused > > + * @nr_switches: > > + * number of context switches > > + * @uclamp: > > + * utilization clamp values based on CPU's RUNNABLE tasks > > + * @uclamp_flags: > > + * flags for uclamp actions, currently one flag for idle. > > + * @cfs: > > + * fair scheduling class runqueue > > + * @rt: > > + * rt scheduling class runqueue > > + * @dl: > > + * dl scheduing class runqueue > > + * @leaf_cfs_rq_list: > > + * list of leaf cfs_rq on this CPU > > + * @tmp_alone_branch: > > + * reference to add child before its parent in leaf_cfs_rq_list > > + * @nr_uninterruptible: > > + * global counter where the total sum over all CPUs matters. A task > > + * can increase this counter on one CPU and if it got migrated > > + * afterwards it may decrease it on another CPU. Always updated under > > + * the runqueue lock > > + * @curr: > > + * points to the currently running task of this rq. > > + * @idle: > > + * points to the idle task of this rq > > + * @stop: > > + * points to the stop task of this rq > > + * @next_balance: > > + * shortest next balance before updating nohz.next_balance > > + * @prev_mm: > > + * real address space of the previous task > > + * @clock_update_flags: > > + * RQCF clock_update_flags bits > > + * @clock: > > + * sched_clock() value for the queue > > + * @clock_task: > > + * clock value minus irq handling time > > + * @clock_pelt: > > + * clock which scales with current capacity when something is > > + * running on rq and synchronizes with clock_task when rq is idle > > + * @lost_idle_time: > > + * idle time lost when utilization of a rq has reached the > > + * maximum value > > + * @nr_iowait: > > + * account the idle time that we could have spend running if it > > + * were not for IO > > + * @membarrier_state: > > + * copy of membarrier_state from the mm_struct > > + * @rd: > > + * root domain, each exclusive cpuset essentially defines an island > > + * domain by fully partitioning the member CPUs from any other cpuset > > + * @sd: > > + * a domain heirarchy of CPU groups to balance process load among them > > + * @cpu_capacity: > > + * information about CPUs heterogeneity used for CPU performance > > + * scaling > > + * @cpu_capacity_orig: > > + * original capacity of a CPU before being altered by > > + * rt tasks and/or IRQ > > + * @balance_callback: > > + * queue to hold load balancing push and pull operations > > + * @idle_balance: > > + * flag to do the nohz idle load balance > > + * @misfit_task_load: > > + * set whenever the current running task has a utilization > > + * greater than 80% of rq->cpu_capacity. A non-zero value > > + * in this field enables misfit load balancing > > + * @active_balance: > > + * synchronizes accesses to ->active_balance_work > > + * @push_cpu: > > + * idle cpu to push the running task on to during active load > > + * balancing. > > + * @active_balance_work: > > + * callback scheduled to run on one or multiple cpus > > + * with maximum priority monopolozing those cpus. > > + * @cpu: > > + * CPU of this runqueue > > + * @online: > > + * Used by scheduling classes to support CPU hotplug > > + * @cfs_tasks: > > + * an MRU list used for load balancing, sorted (except > > + * woken tasks) starting from recently given CPU time tasks > > + * toward tasks with max wait time in a run-queue > > + * @avg_rt: > > + * track the utilization of RT tasks for a more accurate > > + * view of the utilization of the CPU when overloaded by CFS and > > + * RT tasks > > + * @avg_dl: > > + * track the utilization of DL tasks as CFS tasks can be preempted > > + * by DL tasks and the CFS's utilization might no longer describe > > + * the real utilization level > > + * @avg_irq: > > + * track the the utilization of interrupt to give a more accurate > > + * level of utilization of CPU taking into account the time spent > > + * under interrupt context when rqs' clock is updated > > + * @avg_thermal: > > + * tracks thermal pressure which is the reduction in maximum > > + * possible capacity due to thermal events > > + * @idle_stamp: > > + * time stamp at which idle load balance started for this rq. > > + * Used to find the idlest CPU, when multiple idle CPUs are in > > + * the same state > > + * @avg_idle: > > + * average idle time for this rq > > + * @max_idle_balance_cost: > > + * used to determine avg_idle's max value > > + * @prev_irq_time: > > + * updated to account time consumed when a previous > > + * update_rq_clock() happened inside a {soft,}irq region > > + * @prev_steal_time: > > + * to account how much elapsed time was spent in steal > > + * @prev_steal_time_rq: > > + * for fine granularity task steal time accounting by > > + * making update_rq_clock() aware of steal time > > + * @calc_load_update: > > + * sample window for global load-average calculations > > + * @calc_load_active: > > + * fold any nr_active delta into a global accumulate > > + * @hrtick_csd: > > + * call_single_data used to set hrtick timer state on a specific CPU > > + * @hrtick_timer: > > + * HR-timer to deliver an accurate preemption tick > > + * @rq_sched_info: > > + * runqueue specific latency stats > > + * @rq_cpu_time: > > + * runqueue specific accumulated per-task cpu runtime > > + * @yld_count: > > + * runqueue specific sys_sched_yield() stats > > + * @sched_count: > > + * runqueue specific __schedule() stats > > + * @sched_goidle: > > + * runqueue specific idle scheduling class stats > > + * @ttwu_count: > > + * runqueue specific idle ttwu stats , both remote and local > > + * @ttwu_local: > > + * ttwu count for the CPU of the rq > > + * @wake_list: > > + * list which stores tasks being woken up remotely by ttwu > > + * @idle_state: > > + * cpuidle state pointer of the CPU of this rq used to make a > > + * better decision when balancing tasks > > Holy moly! I appreciate the effort, but I'm thinking the targeted audience > might prefer this under the form of comments within the struct definition... This was done based on review comments to move describing the rq struct from documentation for kernel-doc comments so that documentation will be updated when parameters are changed/removed. > > > */ > > struct rq { > > /* runqueue lock: */ > > @@ -1136,7 +1299,7 @@ static inline u64 rq_clock_task(struct rq *rq) > > return rq->clock_task; > > } > > > > -/** > > +/* > > * By default the decay is the default pelt decay period. > > * The decay shift can change the decay period in > > * multiples of 32. Thanks for your review