Re: [RFC PATCH 2/3] docs: scheduler: Add scheduler overview documentation

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Wed, 1 Apr 2020 12:35:20 +0200

On Wed, Apr 01, 2020 at 01:00:28PM +0300, John Mathew wrote:

I dispise RST, it's an unreadable mess, but I did skim the document and
felt I should comment on this:

> +* _cond_resched() : It gives the scheduler a chance to run a
> +  higher-priority process.
> +
> +* __cond_resched_lock() :  if a reschedule is pending, drop the given
> +  lock, call schedule, and on return reacquire the lock.

Those are not functions anybody should be using; the normal entry points
are: cond_resched() and cond_resched_lock().

> +Scheduler State Transition
> +==========================
> +
> +A very high level scheduler state transition flow with a few states can be
> +depicted as follows.
> +
> +.. kernel-render:: DOT
> +   :alt: DOT digraph of Scheduler state transition
> +   :caption: Scheduler state transition
> +
> +   digraph sched_transition {
> +      node [shape = point,  label="exisiting task\n calls fork()"]; fork
> +      node [shape = box, label="TASK_NEW\n(Ready to run)"] tsk_new;
> +      node [shape = box, label="TASK_RUNNING\n(Ready to run)"] tsk_ready_run;
> +      node [shape = box, label="TASK_RUNNING\n(Running)"] tsk_running;
> +      node [shape = box, label="TASK_DEAD\nEXIT_ZOMBIE"] exit_zombie;
> +      node [shape = box, label="TASK_INTERRUPTIBLE\nTASK_UNINTERRUPTIBLE\nTASK_WAKEKILL"] tsk_int;
> +      fork -> tsk_new [ label = "task\nforks" ];
> +      tsk_new -> tsk_ready_run;
> +      tsk_ready_run -> tsk_running [ label = "schedule() calls context_switch()" ];
> +      tsk_running -> tsk_ready_run [ label = "task is pre-empted" ];
> +      subgraph int {
> +         tsk_running -> tsk_int [ label = "task needs to wait for event" ];
> +         tsk_int ->  tsk_ready_run [ label = "event occurred" ];
> +      }
> +      tsk_int ->  exit_zombie [ label = "task exits via do_exit()" ];
> +   }

And that is a prime example of why I hates RST, it pretty much mandates
you view this with something other than a text editor.

Also, Daniel, you modeled all this, is the above anywhere close?

> +Scheduler provides trace points tracing all major events of the scheduler.
> +The tracepoints are defined in ::
> +
> +  include/trace/events/sched.h
> +
> +Using these treacepoints it is possible to model the scheduler state transition
> +in an automata model. The following conference paper discusses such modeling.
> +
> +https://www.researchgate.net/publication/332440267_Modeling_the_Behavior_of_Threads_in_the_PREEMPT_RT_Linux_Kernel_Using_Automata

Ah, you've found Daniel ;-)

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..ccefc820557f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3578,8 +3578,12 @@ unsigned long long task_sched_runtime(struct task_struct *p)
>  	return ns;
>  }
>  
> -/*
> - * This function gets called by the timer code, with HZ frequency.
> +/**
> + * scheduler_tick -
> + *
> + * This function is called on every timer interrupt with HZ frequency and
> + * calls scheduler on any task that has used up its quantum of CPU time.
> + *
>   * We call it with interrupts disabled.
>   */
>  void scheduler_tick(void)
> @@ -3958,8 +3962,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	BUG();
>  }
>  
> -/*
> - * __schedule() is the main scheduler function.
> +/**
> + * __schedule() - The main scheduler function.
>   *
>   * The main means of driving the scheduler and thus entering this function are:
>   *
> @@ -4086,6 +4090,12 @@ static void __sched notrace __schedule(bool preempt)
>  	balance_callback(rq);
>  }
>  
> +/**
> + * do_task_dead - Final step of task exit
> + *
> + *  Changes the the task state to TASK_DEAD and calls schedule to pick next
> + *  task to run.
> + */

That has whitespace damage.

>  void __noreturn do_task_dead(void)
>  {
>  	/* Causes final put_task_struct in finish_task_switch(): */
> @@ -4244,7 +4254,9 @@ static void __sched notrace preempt_schedule_common(void)
>  }
>  
>  #ifdef CONFIG_PREEMPTION
> -/*
> +/**
> + * preempt_schedule -
> + *
>   * This is the entry point to schedule() from in-kernel preemption
>   * off of preempt_enable.
>   */
> @@ -4316,7 +4328,9 @@ EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
>  
>  #endif /* CONFIG_PREEMPTION */
>  
> -/*
> +/**
> + * preempt_schedule_irq -
> + *
>   * This is the entry point to schedule() from kernel preemption
>   * off of irq context.
>   * Note, that this is called and return with irqs disabled. This will
> @@ -5614,6 +5628,11 @@ SYSCALL_DEFINE0(sched_yield)
>  }
>  
>  #ifndef CONFIG_PREEMPTION
> +/**
> + * _cond_resched -
> + *
> + * gives the scheduler a chance to run a higher-priority process
> + */
>  int __sched _cond_resched(void)
>  {
>  	if (should_resched(0)) {
> @@ -5626,9 +5645,10 @@ int __sched _cond_resched(void)
>  EXPORT_SYMBOL(_cond_resched);
>  #endif
>  
> -/*
> - * __cond_resched_lock() - if a reschedule is pending, drop the given lock,
> +/**
> + * __cond_resched_lock - if a reschedule is pending, drop the given lock,
>   * call schedule, and on return reacquire the lock.
> + * @lock: target lock
>   *
>   * This works OK both with and without CONFIG_PREEMPTION. We do strange low-level
>   * operations here to prevent schedule() from being called twice (once via

Just know that the first time someone comes and whines about how a
scheduler comment doesn't build or generates bad output, I remove the
/** kerneldoc thing.

Also, like I said above, _cond_resched() and __cond_resched_lock()
really should not be exposed like this, they're not API.