Hi John, On Thu, May 07, 2020 at 09:05:52PM +0300, John Mathew wrote: > Add documentation for > -scheduler overview > -scheduler state transtion > -CFS overview > -scheduler data structs > > Add rst for scheduler APIs and modify sched/core.c > to add kernel-doc comments. > > Suggested-by: Lukas Bulwahn <lukas.bulwahn@xxxxxxxxx> > Co-developed-by: Mostafa Chamanara <mostafa.chamanara@xxxxxxxxxxxx> > Signed-off-by: Mostafa Chamanara <mostafa.chamanara@xxxxxxxxxxxx> > Co-developed-by: Oleg Tsymbal <oleg.tsymbal@xxxxxxxxxx> > Signed-off-by: Oleg Tsymbal <oleg.tsymbal@xxxxxxxxxx> > Signed-off-by: John Mathew <john.mathew@xxxxxxxxxx> > --- > Documentation/scheduler/cfs-overview.rst | 113 ++++++++ > Documentation/scheduler/index.rst | 7 +- > Documentation/scheduler/overview.rst | 266 ++++++++++++++++++ > .../scheduler/sched-data-structs.rst | 253 +++++++++++++++++ > Documentation/scheduler/scheduler-api.rst | 31 ++ > kernel/sched/core.c | 28 +- > kernel/sched/sched.h | 169 ++++++++++- > 7 files changed, 858 insertions(+), 9 deletions(-) > create mode 100644 Documentation/scheduler/cfs-overview.rst > create mode 100644 Documentation/scheduler/sched-data-structs.rst > create mode 100644 Documentation/scheduler/scheduler-api.rst > > diff --git a/Documentation/scheduler/cfs-overview.rst b/Documentation/scheduler/cfs-overview.rst > new file mode 100644 > index 000000000000..b717f2d3e340 > --- /dev/null > +++ b/Documentation/scheduler/cfs-overview.rst > @@ -0,0 +1,113 @@ > +.. SPDX-License-Identifier: GPL-2.0+ > + > +============= > +CFS Overview > +============= > + > +Linux 2.6.23 introduced a modular scheduler core and a Completely Fair > +Scheduler (CFS) implemented as a scheduling module. A brief overview of the > +CFS design is provided in :doc:`sched-design-CFS` > + > +In addition there have been many improvements to the CFS, a few of which are > + > +**Thermal Pressure**: > +cpu_capacity initially reflects the maximum possible capacity of a CPU. > +Thermal pressure on a CPU means this maximum possible capacity is > +unavailable due to thermal events. Average thermal pressure for a CPU > +is now subtracted from its maximum possible capacity so that cpu_capacity > +reflects the remaining maximum capacity. > + > +**Use Idle CPU for NUMA balancing**: > +Idle CPU is used as a migration target instead of comparing tasks. > +Information on an idle core is cached while gathering statistics > +and this is used to avoid a second scan of the node runqueues if load is > +not imbalanced. Preference is given to an idle core rather than an > +idle SMT sibling to avoid packing HT siblings due to linearly scanning > +the node cpumask. Multiple tasks can attempt to select an idle CPU but > +fail because a NUMA balance is active on that CPU, in this case instead of > +failing, an alternative idle CPU scanned. > + > +**Asymmetric CPU capacity wakeup scan**: > +Previous assumption that CPU capacities within an SD_SHARE_PKG_RESOURCES > +domain (sd_llc) are homogeneous didn't hold for newer generations of big.LITTLE > +systems (DynamIQ) which can accommodate CPUs of different compute capacity > +within a single LLC domain. A new idle sibling helper function was added > +which took CPU capacity into account. The policy is to pick the first idle > +CPU which is big enough for the task (task_util * margin < cpu_capacity). > +If no idle CPU is big enough, the idle CPU with the highest capacity is > +picked. > + > +**Optimized idle core selection**: > +Previously all threads of a core were looped through to evaluate if the > +core is idle or not. This was unnecessary. If a thread of a core is not > +idle, skip evaluating other threads of a core. Also while clearing the > +cpumask, bits of all CPUs of a core can be cleared in one shot. > + > +**Load balance aggressively for SCHED_IDLE CPUs**: > +The fair scheduler performs periodic load balance on every CPU to check > +if it can pull some tasks from other busy CPUs. The duration of this > +periodic load balance is set to scheduler domain's balance_interval and > +multiplied by a busy_factor (set to 32 by default) for the busy CPUs. This > +multiplication is done for busy CPUs to avoid doing load balance too > +often and rather spend more time executing actual task. While that is > +the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH > +tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE > +tasks. With the recent enhancements in the fair scheduler around SCHED_IDLE > +CPUs, it is now preferred to enqueue a newly-woken task to a SCHED_IDLE > +CPU instead of other busy or idle CPUs. The same reasoning is applied > +to the load balancer as well to make it migrate tasks more aggressively > +to a SCHED_IDLE CPU, as that will reduce the scheduling latency of the > +migrated (SCHED_OTHER) tasks. Fair scheduler now does the next > +load balance soon after the last non-SCHED_IDLE task is dequeued from a > +runqueue, i.e. making the CPU SCHED_IDLE. > + > +**Load balancing algorithm Reworked**: > +The load balancing algorithm contained some heuristics which became > +meaningless since the rework of the scheduler's metrics like the > +introduction of PELT. The new load balancing algorithm fixes several > +pending wrong tasks placement > + > + * the 1 task per CPU case with asymmetric system > + * the case of CFS task preempted by other class > + * the case of tasks not evenly spread on groups with spare capacity > + > +Also the load balance decisions have been consolidated in the 3 separate > +functions. > + > +**Energy-aware wake-ups speeded up**: > +EAS computes the energy impact of migrating a waking task when deciding > +on which CPU it should run. However, the previous approach had high algorithmic > +complexity, which resulted in prohibitively high wake-up latencies on > +systems with complex energy models, such as systems with per-CPU DVFS. On > +such systems, the algorithm complexity was O(n^2). To address this, > +the EAS wake-up path was re-factored to compute the energy 'delta' on a > +per-performance domain basis, rather than system-wide, which brings the > +complexity down to O(n). > + > +**Selection of an energy-efficient CPU on task wake-up**: > +If an Energy Model (EM) is available and if the system isn't overutilized, > +waking tasks are re-routed into an energy-aware placement algorithm. > +The selection of an energy-efficient CPU for a task is achieved by estimating > +the impact on system-level active energy resulting from the placement of the > +task on the CPU with the highest spare capacity in each performance domain. > +This strategy spreads tasks in a performance domain and avoids overly > +aggressive task packing. The best CPU energy-wise is then selected if it > +saves a large enough amount of energy with respect to prev_cpu. > + > +**Consider misfit tasks when load-balancing**: > +On asymmetric CPU capacity systems load intensive tasks can end up on > +CPUs that don't suit their compute demand. In this scenario 'misfit' > +tasks are migrated to CPUs with higher compute capacity to ensure better > +throughput. A new group_type: group_misfit_task is added and indicates this > +scenario. Tweaks to the load-balance code are done to make the migrations > +happen. Misfit balancing is done between a source group of lower per-CPU > +capacity and destination group of higher compute capacity. Otherwise, misfit > +balancing is ignored. > + > +**Make schedstats a runtime tunable that is disabled by default**: > +schedstats is very useful during debugging and performance tuning but it > +incurred overhead to calculate the stats. A kernel command-line and sysctl > +tunable was added to enable or disable schedstats on demand (when it's built in). > +It is disabled by default. The benefits are dependent on how > +scheduler-intensive the workload is. > + > diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst > index ede1a30a6894..f311abe5b711 100644 > --- a/Documentation/scheduler/index.rst > +++ b/Documentation/scheduler/index.rst > @@ -17,10 +17,13 @@ specific implementation differences. > :maxdepth: 2 > > overview > + sched-data-structs > + cfs-overview > sched-design-CFS > sched-features > - arch-specific.rst > - sched-debugging.rst > + arch-specific > + sched-debugging > + scheduler-api > > .. only:: subproject and html > > diff --git a/Documentation/scheduler/overview.rst b/Documentation/scheduler/overview.rst > index aee16feefc61..f2cb0c901208 100644 > --- a/Documentation/scheduler/overview.rst > +++ b/Documentation/scheduler/overview.rst > @@ -3,3 +3,269 @@ > ==================== > Scheduler overview > ==================== > + > +Linux kernel implements priority-based scheduling. More than one process are > +allowed to run at any given time and each process is allowed to run as if it > +were the only process on the system. The process scheduler coordinates which > +process runs when. In that context, it has the following tasks: > + > +* share CPU cores equally among all currently running processes. > +* pick appropriate process to run next if required, considering scheduling > + class/policy and process priorities. > +* balance processes between multiple cores in SMP systems. > + > +The scheduler attempts to be responsive for I/O bound processes and efficient > +for CPU bound processes. The scheduler also applies different scheduling > +policies for real time and normal processes based on their respective > +priorities. Higher priorities in the kernel have a numerical smaller > +value. Real time priorities range from 1 (highest) – 99 whereas normal > +priorities range from 100 – 139 (lowest). SCHED_DEADLINE tasks have negative > +priorities, reflecting the fact that any of them has higher priority than > +RT and NORMAL/BATCH tasks. > + > +Process Management > +================== > + > +Each process in the system is represented by struct task_struct. When a > +process/thread is created, the kernel allocates a new task_struct for it. > +The kernel then stores this task_struct in an RCU list. Macro next_task() > +allows a process to obtain its next task and for_each_process() macro enables > +traversal of the list. > + > +Frequently used fields of the task struct are: > + > +*state:* The running state of the task. The possible states are: > + > +* TASK_RUNNING: The task is currently running or in a run queue waiting > + to run. > +* TASK_INTERRUPTIBLE: The task is sleeping waiting for some event to occur. > + This task can be interrupted by signals. On waking up the task transitions > + to TASK_RUNNING. > +* TASK_UNINTERRUPTIBLE: Similar to TASK_INTERRUPTIBLE but does not wake > + up on signals. Needs an explicit wake-up call to be woken up. Contributes > + to loadavg. > +* __TASK_TRACED: Task is being traced by another task like a debugger. > +* __TASK_STOPPED: Task execution has stopped and not eligible to run. > + SIGSTOP, SIGTSTP etc causes this state. The task can be continued by > + the signal SIGCONT. > +* TASK_PARKED: State to support kthread parking/unparking. > +* TASK_DEAD: If a task dies, then it sets TASK_DEAD in tsk->state and calls > + schedule one last time. The schedule call will never return. > +* TASK_WAKEKILL: It works like TASK_UNINTERRUPTIBLE with the bonus that it > + can respond to fatal signals. > +* TASK_WAKING: To handle concurrent waking of the same task for SMP. > + Indicates that someone is already waking the task. > +* TASK_NOLOAD: To be used along with TASK_UNINTERRUPTIBLE to indicate > + an idle task which does not contribute to loadavg. > +* TASK_NEW: Set during fork(), to guarantee that no one will run the task, > + a signal or any other wake event cannot wake it up and insert it on > + the runqueue. > + > +*exit_state* : The exiting state of the task. The possible states are: > + > +* EXIT_ZOMBIE: The task is terminated and waiting for parent to collect > + the exit information of the task. > +* EXIT_DEAD: After collecting the exit information the task is put to > + this state and removed from the system. > + > +*static_prio:* Nice value of a task. The value of this field does > + not change. Value ranges from -20 to 19. This value is mapped to nice > + value and used in the scheduler. > + > +*prio:* Dynamic priority of a task. Previously a function of static > + priority and tasks interactivity. Value not used by CFS scheduler but used > + by the RT scheduler. Might be boosted by interactivity modifiers. Changes > + upon fork, setprio syscalls, and whenever the interactivity estimator > + recalculates. > + > +*normal_prio:* Expected priority of a task. The value of static_prio > + and normal_prio are the same for non-real-time processes. For real time > + processes value of prio is used. > + > +*rt_priority:* Field used by real time tasks. Real time tasks are > + prioritized based on this value. > + > +*sched_class:* Pointer to sched_class CFS structure. > + > +*sched_entity:* Pointer to sched_entity CFS structure. > + > +*policy:* Value for scheduling policy. The possible values are: > + > +* SCHED_NORMAL: Regular tasks use this policy. > +* SCHED_BATCH: Tasks which need to run longer without preemption > + use this policy. Suitable for batch jobs. > +* SCHED_IDLE: Policy used by background tasks. > +* SCHED_FIFO & SCHED_RR: These policies for real time tasks. Handled by > + real time scheduler. > +* SCHED_DEADLINE: Tasks which are activated on a periodic or sporadic fashion > + use this policy. This policy implements the Earliest Deadline First (EDF) > + scheduling algorithm. This policy is explained in detail in the > + :doc:`sched-deadline` documentation. > + > +*nr_cpus_allowed:* Bit field containing tasks affinity towards a set of > + CPU cores. Set using sched_setaffinity() system call. > + > +New processes are created using the fork() system call which is described > +at manpage :manpage:`FORK(2)` or the clone system call described at > +:manpage:`CLONE(2)`. > +Users can create threads within a process to achieve parallelism. Threads > +share address space, open files and other resources of the process. Threads > +are created like normal tasks with their unique task_struct, but clone() > +is provided with flags that enable the sharing of resources such as address > +space :: > + > + clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0); > + > +The scheduler schedules task_structs so from scheduler perspective there is > +no difference between threads and processes. Threads are created using > +the system call pthread_create described at :manpage:`PTHREAD_CREATE(3)` > +POSIX threads creation is described at :manpage:`PTHREADS(7)` > + > +The Scheduler Entry Point > +========================= > + > +The main scheduler entry point is an architecture independent schedule() > +function defined in kernel/sched/core.c. Its objective is to find a process in > +the runqueue list and then assign the CPU to it. It is invoked, directly > +or in a lazy (deferred) way from many different places in the kernel. A lazy > +invocation does not call the function by its name, but gives the kernel a > +hint by setting a flag TIF_NEED_RESCHED. The flag is a message to the kernel > +that the scheduler should be invoked as soon as possible because another > +process deserves to run. > + > +Following are some places that notify the kernel to schedule: > + > +* scheduler_tick() > + > +* Running task goes to sleep state : Right before a task goes to sleep, > + schedule() will be called to pick the next task to run and the change > + its state to either TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE. For > + instance, prepare_to_wait() is one of the functions that makes the > + task go to the sleep state. > + > +* try_to_wake_up() > + > +* yield() > + > +* wait_event() > + > +* cond_resched() : It gives the scheduler a chance to run a higher-priority > + process. > + > +* cond_resched_lock() : If a reschedule is pending, drop the given lock, > + call schedule, and on return reacquire the lock. > + > +* do_task_dead() > + > +* preempt_schedule() : The function checks whether local interrupts are > + enabled and the preempt_count field of current is zero; if both > + conditions are true, it invokes schedule() to select another process > + to run. > + > +* preempt_schedule_irq() > + > +Calling functions mentioned above leads to a call to __schedule(). Note > +that preemption must be disabled before it is called and enabled after > +the call using preempt_disable and preempt_enable functions family. > + > + > +The steps during invocation are: > +-------------------------------- > +1. Disable preemption to avoid another task preempting the scheduling > + thread itself. > +2. Retrieve the runqueue of current processor and its lock is obtained to > + allow only one thread to modify the runqueue at a time. > +3. The state of the previously executed task when the schedule() > + was called is examined. If it is not runnable and has not been > + preempted in kernel mode, it is removed from the runqueue. If the > + previous task has non-blocked pending signals, its state is set to > + TASK_RUNNING and left in the runqueue. > +4. Scheduler classes are iterated and the corresponding class hook to > + pick the next suitable task to be scheduled on the CPU is called. > + Since most tasks are handled by the sched_fair class, a shortcut to this > + class is implemented in the beginning of the function. > +5. TIF_NEED_RESCHED and architecture specific need_resched flags are cleared. > +6. If the scheduler class picks a different task from what was running > + before, a context switch is performed by calling context_switch(). > + Internally, context_switch() switches to the new task's memory map and > + swaps the register state and stack. If scheduler class picked the same > + task as the previous task, no task switch is performed and the current > + task keeps running. > +7. Balance callback list is processed. Each scheduling class can migrate tasks > + between CPUs to balance load. These load balancing operations are queued > + on a Balance callback list which get executed when balance_callback() is > + called. > +8. The runqueue is unlocked and preemption is re-enabled. In case > + preemption was requested during the time in which it was disabled, > + schedule() is run again right away. > + > +Scheduler State Transition > +========================== > + > +A very high level scheduler state transition flow with a few states can > +be depicted as follows. :: > + > + * > + | > + | task > + | forks > + v > + +------------------------------+ > + | TASK_NEW | > + | (Ready to run) | > + +------------------------------+ > + | > + | > + v > + +------------------------------------+ > + | TASK_RUNNING | > + +---------------> | (Ready to run) | <--+ > + | +------------------------------------+ | > + | | | > + | | schedule() calls context_switch() | task is preempted > + | v | > + | +------------------------------------+ | > + | | TASK_RUNNING | | > + | | (Running) | ---+ > + | event occurred +------------------------------------+ > + | | > + | | task needs to wait for event > + | v > + | +------------------------------------+ > + | | TASK_INTERRUPTIBLE | > + | | TASK_UNINTERRUPTIBLE | > + +-----------------| TASK_WAKEKILL | > + +------------------------------------+ > + | > + | task exits via do_exit() > + v > + +------------------------------+ > + | TASK_DEAD | > + | EXIT_ZOMBIE | > + +------------------------------+ > + > + > +Scheduler provides tracepoints tracing all major events of the scheduler. > +The tracepoints are defined in :: > + > + include/trace/events/sched.h > + > +Using these tracepoints it is possible to model the scheduler state transition > +in an automata model. The following journal paper discusses such modeling: > + > +Daniel B. de Oliveira, Rômulo S. de Oliveira, Tommaso Cucinotta, **A thread > +synchronization model for the PREEMPT_RT Linux kernel**, *Journal of Systems > +Architecture*, Volume 107, 2020, 101729, ISSN 1383-7621, > +https://doi.org/10.1016/j.sysarc.2020.101729. > + > +To model the scheduler efficiently the system was divided in to generators > +and specifications. Some of the generators used were "need_resched", > +"sleepable" and "runnable", "thread_context" and "scheduling context". > +The specifications are the necessary and sufficient conditions to call > +the scheduler. New trace events were added to specify the generators > +and specifications. In case a kernel event referred to more than one > +event, extra fields of the kernel event was used to distinguish between > +automation events. The final model was generated from parallel composition > +of all generators and specifications which composed of 34 events, > +12 generators and 33 specifications. This resulted in 9017 states, and > +20103 transitions. > diff --git a/Documentation/scheduler/sched-data-structs.rst b/Documentation/scheduler/sched-data-structs.rst > new file mode 100644 > index 000000000000..ec900367cc0c > --- /dev/null > +++ b/Documentation/scheduler/sched-data-structs.rst > @@ -0,0 +1,253 @@ > +.. SPDX-License-Identifier: GPL-2.0+ > + > +========================= > +Scheduler Data Structures > +========================= > + > +The main parts of the Linux scheduler are: > + > +Runqueue > +~~~~~~~~ > + > +:c:type:`struct rq <rq>` is the central data structure of process > +scheduling. It keeps track of tasks that are in a runnable state assigned > +for a particular processor. Each CPU has its own run queue and stored in a > +per CPU array:: > + > + DEFINE_PER_CPU(struct rq, runqueues); > + > +Access to the queue requires locking and lock acquire operations must be > +ordered by ascending runqueue. Macros for accessing and locking the runqueue > +are provided in:: > + > + kernel/sched/sched.h > + > +The runqueue contains scheduling class specific queues and several scheduling > +statistics. > + > +Scheduling entity > +~~~~~~~~~~~~~~~~~ > +Scheduler uses scheduling entities which contain sufficient information to > +actually accomplish the scheduling job of a task or a task-group. The > +scheduling entity may be a group of tasks or a single task. Every task is > +associated with a sched_entity structure. CFS adds support for nesting of > +tasks and task groups. Each scheduling entity may be run from its parents > +runqueue. The scheduler traverses the sched_entity hierarchy to pick the > +next task to run on the CPU. The entity gets picked up from the cfs_rq on > +which it is queued and its time slice is divided among all the tasks on its my_q. > + > +Virtual Runtime > +~~~~~~~~~~~~~~~~~ > +Virtual Run Time or vruntime is the amount of time a task has spent running > +on the CPU. It is updated periodically by scheduler_tick(). Tasks are stored > +in the CFS scheduling class rbtree sorted by vruntime. scheduler_tick() calls > +corresponding hook of CFS which first updates the runtime statistics of the > +currently running task and checks if the current task needs to be preempted. > +vruntime of the task based on the formula :: > + > + vruntime += delta_exec * (NICE_0_LOAD/curr->load.weight); > + > +where: > + > +* delta_exec is the time in nanoseconds spent by the task since the last time > + vruntime was updated. > +* NICE_0_LOAD is the load of a task with normal priority. > +* curr is the shed_entity instance of the cfs_rq struct of the currently s/shed_entity/sched_entity/ > + running task. > +* load.weight: sched_entity load_weight. load_weight is the encoding of > + the tasks priority and vruntime. The load of a task is the metric > + indicating the number of CPUs needed to make satisfactory progress on its > + job. Load of a task influences the time a task spends on the CPU and also > + helps to estimate the overall CPU load which is needed for load balancing. > + Priority of the task is not enough for the scheduler to estimate the > + vruntime of a process. So priority value must be mapped to the capacity of > + the standard CPU which is done in the array :c:type:`sched_prio_to_weight[]`. > + The array contains mappings for the nice values from -20 to 19. Nice value > + 0 is mapped to 1024. Each entry advances by approximately 1.25 which means > + for every increment in nice value the task gets 10% less CPU and vice versa. > + > +Scheduler classes > +~~~~~~~~~~~~~~~~~ > +It is an extensible hierarchy of scheduler modules. The modules encapsulate > +scheduling policy details. They are called from the core code which is > +independent. Scheduling classes are implemented through the sched_class > +structure. dl_sched_class for deadline scheduler, fair_sched_class for CFS > +and rt_sched_class for RT are implementations of this class. > + > +The important methods of scheduler class are: > + > +enqueue_task and dequeue_task > + These functions are used to put and remove tasks from the runqueue > + respectively. The function takes the runqueue, the task which needs to > + be enqueued/dequeued and a bit mask of flags. The main purpose of the > + flags is to describe why the enqueue or dequeue is being called. > + The different flags used are described in :: > + > + kernel/sched/sched.h > + > + enqueue_task and dequeue_task are called for following purposes. > + > + - When waking a newly created task for the first time. Called with > + ENQUEUE_NOCLOCK > + - When migrating a task from one CPU's runqueue to another. Task will be > + first dequeued from its old runqueue, new CPU will be added to the > + task struct, runqueue of the new CPU will be retrieved and task is > + then enqueued on this new runqueue. > + - When do_set_cpus_allowed() is called to change a tasks CPU affinity. If > + the task is queued on a runqueue, it is first dequeued with the > + DEQUEUE_SAVE and DEQUEUE_NOCLOCK flags set. The set_cpus_allowed() > + function of the corresponding scheduling class will be called. > + enqueue_task() is then called with ENQUEUE_RESTORE and ENQUEUE_NOCLOCK > + flags set. > + - When changing the priority of a task using rt_mutex_setprio(). This > + function implements the priority inheritance logic of the RT mutex > + code. This function changes the effective priority of a task which may > + in turn change the scheduling class of the task. If so enqueue_task is > + called with flags corresponding to each class. > + - When user changes the nice value of the task. If the task is queued on > + a runqueue, it first needs to be dequeued, then its load weight and > + effective priority needs to be set. Following which the task is > + enqueued with ENQUEUE_RESTORE and ENQUEUE_NOCLOCK flags set. > + - When __sched_setscheduler() is called. This function enables changing > + the scheduling policy and/or RT priority of a thread. If the task is > + on a runqueue, it will be first dequeued, changes will be made and > + then enqueued. > + - When moving tasks between scheduling groups. The runqueue of the tasks > + is changed when moving between groups. For this purpose if the task > + is running on a queue, it is first dequeued with DEQUEUE_SAVE, DEQUEUE_MOVE > + and DEQUEUE_NOCLOCK flags set, followed by which scheduler function to > + change the tsk->se.cfs_rq and tsk->se.parent and then task is enqueued > + on the runqueue with the same flags used in dequeue. > + > +pick_next_task > + Called by __schedule() to pick the next best task to run. > + Scheduling class structure has a pointer pointing to the next scheduling > + class type and each scheduling class is linked using a singly linked list. > + The __schedule() function iterates through the corresponding > + functions of the scheduler classes in priority order to pick up the next > + best task to run. Since tasks belonging to the idle class and fair class > + are frequent, the scheduler optimizes the picking of next task to call > + the pick_next_task_fair() if the previous task was of the similar > + scheduling class. > + > +put_prev_task > + Called by the scheduler when a running task is being taken off a CPU. > + The behavior of this function depends on individual scheduling classes > + and called in the following cases. > + > + - When do_set_cpus_allowed() is called and if the task is currently running. > + - When scheduler pick_next_task() is called, the put_prev_task() is > + called with the previous task as function argument. > + - When rt_mutex_setprio() is called and if the task is currently running. > + - When user changes the nice value of the task and if the task is > + currently running. > + - When __sched_setscheduler() is called and if the task is currently > + running. > + - When moving tasks between scheduling groups through the sched_move_task() > + and if the task is ćurrently running. s/ćurrently/currently Thanks, Tau > + > + In CFS class this function is used to put the currently running task back > + into the CFS RB tree. When a task is running it is dequeued from the tree. > + This is to prevent redundant enqueue's and dequeue's for updating its > + vruntime. vruntime of tasks on the tree needs to be updated by update_curr() > + to keep the tree in sync. In DL(deadline) and RT classes additional tree is > + maintained for facilitating task migration between CPUs through push > + operation between runqueues for load balancing. Task will be added to > + this queue if it is present on the scheduling class rq and the task has > + affinity to more than one CPU. > + > +set_next_task > + Pairs with the put_prev_task(), this function is called when the next > + task is set to run on the CPU. This function is called in all the places > + where put_prev_task is called to complete the 'change'. Change is defined > + as the following sequence of calls:: > + > + - dequeue task > + - put task > + - change the property > + - enqueue task > + - set task as current task > + > + It resets the run time statistics for the entity with > + the runqueue clock. > + In case of CFS scheduling class, it will set the pointer to the current > + scheduling entity to the picked task and accounts bandwidth usage on > + the cfs_rq. In addition it will also remove the current entity from the > + CFS runqueue for vruntime update optimization opposite to what was done > + in put_prev_task. > + For the SCHED_DEADLINE and RT classes it will > + > + - dequeue the picked task from the tree of pushable tasks. > + - update the load average in case the previous task belonged to another > + class. > + - queues the function to push tasks from current runqueue to other CPUs > + which can preempt and start execution. Balance callback list is used. > + > +task_tick > + Called from scheduler_tick(), hrtick() and sched_tick_remote() to update > + the current task statistics and load averages. Also restarting the high > + resolution tick timer is done if high resolution timers are enabled. > + scheduler_tick() runs at 1/HZ and is called from the timer interrupt > + handler of the Kernel internal timers. > + hrtick() is called from high resolution timers to deliver an accurate > + preemption tick as the regular scheduler tick that runs at 1/HZ can be > + too coarse when nice levels are used. > + sched_tick_remote() Gets called by the offloaded residual 1Hz scheduler > + tick. In order to reduce interruptions to bare metal tasks, it is possible > + to outsource these scheduler ticks to the global workqueue so that a > + housekeeping CPU handles those remotely. > + > +select_task_rq > + Called by scheduler to get the CPU to assign a task to and migrating > + tasks between CPUs. Flags describe the reason the function was called. > + Called by try_to_wake_up() with SD_BALANCE_WAKE flag which wakes up a > + sleeping task. > + Called by wake_up_new_task() with SD_BALANCE_FORK flag which wakes up a > + newly forked task. > + Called by sched_exec() with SD_BALANCE_EXEC which is called from execv > + syscall. > + SCHED_DEADLINE class decides the CPU on which the task should be woken > + up based on the deadline. RT class decides based on the RT priority. Fair > + scheduling class balances load by selecting the idlest CPU in the > + idlest group, or under certain conditions an idle sibling CPU if the > + domain has SD_WAKE_AFFINE set. > + > +balance > + Called by pick_next_task() from scheduler to enable scheduling classes > + to pull tasks from runqueues of other CPUs for balancing task execution > + between the CPUs. > + > +task_fork > + Called from sched_fork() of scheduler which assigns a task to a CPU. > + Fair scheduling class updates runqueue clock, runtime statistics and > + vruntime for the scheduling entity. > + > +yield_task > + Called from SYSCALL sched_yield to yield the CPU to other tasks. > + SCHED_DEADLINE class forces the runtime of the task to zero using a special > + flag and dequeues the task from its trees. RT class requeues the task > + entities to the end of the run list. Fair scheduling class implements > + the buddy mechanism. This allows skipping onto the next highest priority > + scheduling entity at every level in the CFS tree, unless doing so would > + introduce gross unfairness in CPU time distribution. > + > +check_preempt_curr > + Check whether the task that woke up should preempt the currently > + running task. Called by scheduler, > + > + * when moving queued task to new runqueue > + * ttwu() > + * when waking up newly created task for the first time. > + > + SCHED_DEADLINE class compares the deadlines of the tasks and calls > + scheduler function resched_curr() if the preemption is needed. In case > + the deadlines are equal, migratability of the tasks is used a criteria > + for preemption. > + RT class behaves the same except it uses RT priority for comparison. > + Fair class sets the buddy hints before calling resched_curr() to preempt. > + > +Scheduler sets the scheduler class for each task based on its priority. > +Tasks assigned with SCHED_NORMAL, SCHED_IDLE and SCHED_BATCH call > +fair_sched_class hooks and tasks assigned with SCHED_RR and > +SCHED_FIFO call rt_sched_class hooks. Tasks assigned with SCHED_DEADLINE > +policy calls dl_sched_class hooks. > diff --git a/Documentation/scheduler/scheduler-api.rst b/Documentation/scheduler/scheduler-api.rst > new file mode 100644 > index 000000000000..1fc6bd4c2908 > --- /dev/null > +++ b/Documentation/scheduler/scheduler-api.rst > @@ -0,0 +1,31 @@ > +.. SPDX-License-Identifier: GPL-2.0+ > + > +============================= > +Scheduler related functions > +============================= > + > + > +.. kernel-doc:: kernel/sched/core.c > + :functions: __schedule > + > +.. kernel-doc:: kernel/sched/core.c > + :functions: scheduler_tick > + > +.. kernel-doc:: kernel/sched/core.c > + :functions: try_to_wake_up > + > +.. kernel-doc:: kernel/sched/core.c > + :functions: do_task_dead > + > +.. kernel-doc:: kernel/sched/core.c > + :functions: preempt_schedule_irq > + > +.. kernel-doc:: kernel/sched/core.c > + :functions: prepare_task_switch > + > +.. kernel-doc:: kernel/sched/core.c > + :functions: finish_task_switch > + > +.. kernel-doc:: kernel/sched/sched.h > + :functions: rq > + > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 9a2fbf98fd6f..b349ed9b4d92 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3576,9 +3576,13 @@ void arch_set_thermal_pressure(struct cpumask *cpus, > WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure); > } > > -/* > +/** > + * scheduler_tick - sched tick timer handler > + * > * This function gets called by the timer code, with HZ frequency. > * We call it with interrupts disabled. > + * > + * Return: 0. > */ > void scheduler_tick(void) > { > @@ -3959,8 +3963,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > BUG(); > } > > -/* > - * __schedule() is the main scheduler function. > +/** > + * __schedule() - the main scheduler function. > + * > + * @preempt: preemption enabled/disabled > * > * The main means of driving the scheduler and thus entering this function are: > * > @@ -4089,6 +4095,12 @@ static void __sched notrace __schedule(bool preempt) > balance_callback(rq); > } > > +/** > + * do_task_dead - handle task exit > + * > + * Changes the the task state to TASK_DEAD and calls > + * schedule to pick next task to run. > + */ > void __noreturn do_task_dead(void) > { > /* Causes final put_task_struct in finish_task_switch(): */ > @@ -4320,7 +4332,8 @@ EXPORT_SYMBOL_GPL(preempt_schedule_notrace); > > #endif /* CONFIG_PREEMPTION */ > > -/* > +/** > + * preempt_schedule_irq - schedule from irq context > * This is the entry point to schedule() from kernel preemption > * off of irq context. > * Note, that this is called and return with irqs disabled. This will > @@ -5618,6 +5631,13 @@ SYSCALL_DEFINE0(sched_yield) > } > > #ifndef CONFIG_PREEMPTION > +/** > + * _cond_resched - explicit rescheduling > + * > + * gives the scheduler a chance to run a higher-priority process > + * > + * Return: 1 if reschedule was done, 0 if reschedule not done. > + */ > int __sched _cond_resched(void) > { > if (should_resched(0)) { > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index db3a57675ccf..21f2953b72c7 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -865,12 +865,175 @@ struct uclamp_rq { > }; > #endif /* CONFIG_UCLAMP_TASK */ > > -/* > - * This is the main, per-CPU runqueue data structure. > +/** > + * struct rq - This is the main, per-CPU runqueue data structure. > * > * Locking rule: those places that want to lock multiple runqueues > * (such as the load balancing or the thread migration code), lock > * acquire operations must be ordered by ascending &runqueue. > + * > + * @lock: > + * lock to be acquired while modifying the runqueue > + * @nr_running: > + * number of runnable tasks on this queue > + * @nr_numa_running: > + * number of tasks running that care about their placement > + * @nr_preferred_running: > + * number of tasks that are optimally NUMA placed > + * @numa_migrate_on: > + * per run-queue variable to check if NUMA-balance is > + * active on the run-queue > + * @last_blocked_load_update_tick: > + * tick stamp for decay of blocked load > + * @has_blocked_load: > + * idle CPU has blocked load > + * @nohz_tick_stopped: > + * CPU is going idle with tick stopped > + * @nohz_flags: > + * flags indicating NOHZ idle balancer actions > + * @nr_load_updates: > + * unused > + * @nr_switches: > + * number of context switches > + * @uclamp: > + * utilization clamp values based on CPU's RUNNABLE tasks > + * @uclamp_flags: > + * flags for uclamp actions, currently one flag for idle. > + * @cfs: > + * fair scheduling class runqueue > + * @rt: > + * rt scheduling class runqueue > + * @dl: > + * dl scheduing class runqueue > + * @leaf_cfs_rq_list: > + * list of leaf cfs_rq on this CPU > + * @tmp_alone_branch: > + * reference to add child before its parent in leaf_cfs_rq_list > + * @nr_uninterruptible: > + * global counter where the total sum over all CPUs matters. A task > + * can increase this counter on one CPU and if it got migrated > + * afterwards it may decrease it on another CPU. Always updated under > + * the runqueue lock > + * @curr: > + * points to the currently running task of this rq. > + * @idle: > + * points to the idle task of this rq > + * @stop: > + * points to the stop task of this rq > + * @next_balance: > + * shortest next balance before updating nohz.next_balance > + * @prev_mm: > + * real address space of the previous task > + * @clock_update_flags: > + * RQCF clock_update_flags bits > + * @clock: > + * sched_clock() value for the queue > + * @clock_task: > + * clock value minus irq handling time > + * @clock_pelt: > + * clock which scales with current capacity when something is > + * running on rq and synchronizes with clock_task when rq is idle > + * @lost_idle_time: > + * idle time lost when utilization of a rq has reached the > + * maximum value > + * @nr_iowait: > + * account the idle time that we could have spend running if it > + * were not for IO > + * @membarrier_state: > + * copy of membarrier_state from the mm_struct > + * @rd: > + * root domain, each exclusive cpuset essentially defines an island > + * domain by fully partitioning the member CPUs from any other cpuset > + * @sd: > + * a domain heirarchy of CPU groups to balance process load among them > + * @cpu_capacity: > + * information about CPUs heterogeneity used for CPU performance > + * scaling > + * @cpu_capacity_orig: > + * original capacity of a CPU before being altered by > + * rt tasks and/or IRQ > + * @balance_callback: > + * queue to hold load balancing push and pull operations > + * @idle_balance: > + * flag to do the nohz idle load balance > + * @misfit_task_load: > + * set whenever the current running task has a utilization > + * greater than 80% of rq->cpu_capacity. A non-zero value > + * in this field enables misfit load balancing > + * @active_balance: > + * synchronizes accesses to ->active_balance_work > + * @push_cpu: > + * idle cpu to push the running task on to during active load > + * balancing. > + * @active_balance_work: > + * callback scheduled to run on one or multiple cpus > + * with maximum priority monopolozing those cpus. > + * @cpu: > + * CPU of this runqueue > + * @online: > + * Used by scheduling classes to support CPU hotplug > + * @cfs_tasks: > + * an MRU list used for load balancing, sorted (except > + * woken tasks) starting from recently given CPU time tasks > + * toward tasks with max wait time in a run-queue > + * @avg_rt: > + * track the utilization of RT tasks for a more accurate > + * view of the utilization of the CPU when overloaded by CFS and > + * RT tasks > + * @avg_dl: > + * track the utilization of DL tasks as CFS tasks can be preempted > + * by DL tasks and the CFS's utilization might no longer describe > + * the real utilization level > + * @avg_irq: > + * track the the utilization of interrupt to give a more accurate > + * level of utilization of CPU taking into account the time spent > + * under interrupt context when rqs' clock is updated > + * @avg_thermal: > + * tracks thermal pressure which is the reduction in maximum > + * possible capacity due to thermal events > + * @idle_stamp: > + * time stamp at which idle load balance started for this rq. > + * Used to find the idlest CPU, when multiple idle CPUs are in > + * the same state > + * @avg_idle: > + * average idle time for this rq > + * @max_idle_balance_cost: > + * used to determine avg_idle's max value > + * @prev_irq_time: > + * updated to account time consumed when a previous > + * update_rq_clock() happened inside a {soft,}irq region > + * @prev_steal_time: > + * to account how much elapsed time was spent in steal > + * @prev_steal_time_rq: > + * for fine granularity task steal time accounting by > + * making update_rq_clock() aware of steal time > + * @calc_load_update: > + * sample window for global load-average calculations > + * @calc_load_active: > + * fold any nr_active delta into a global accumulate > + * @hrtick_csd: > + * call_single_data used to set hrtick timer state on a specific CPU > + * @hrtick_timer: > + * HR-timer to deliver an accurate preemption tick > + * @rq_sched_info: > + * runqueue specific latency stats > + * @rq_cpu_time: > + * runqueue specific accumulated per-task cpu runtime > + * @yld_count: > + * runqueue specific sys_sched_yield() stats > + * @sched_count: > + * runqueue specific __schedule() stats > + * @sched_goidle: > + * runqueue specific idle scheduling class stats > + * @ttwu_count: > + * runqueue specific idle ttwu stats , both remote and local > + * @ttwu_local: > + * ttwu count for the CPU of the rq > + * @wake_list: > + * list which stores tasks being woken up remotely by ttwu > + * @idle_state: > + * cpuidle state pointer of the CPU of this rq used to make a > + * better decision when balancing tasks > */ > struct rq { > /* runqueue lock: */ > @@ -1136,7 +1299,7 @@ static inline u64 rq_clock_task(struct rq *rq) > return rq->clock_task; > } > > -/** > +/* > * By default the decay is the default pelt decay period. > * The decay shift can change the decay period in > * multiples of 32. > -- > 2.17.1 >