Re: question on priority imbalance in Linux SMP scheduler!

Henrik Austad <henrikau@xxxxxxxxxxxx> · Mon, 1 Dec 2008 13:54:14 +0100

On Friday 28 November 2008 16:08:31 Thirupathiah Annapureddy wrote:
> Hi All,

Hello

> I am new to linux kernel. I was refering to an interesting article titled
> "Inside the Linux scheduler", published at
> http://www.ibm.com/developerworks/linux/library/l-scheduler/.

Note, this article is rather outdated as it describes the O(1) scheduler, and not the new CFS (which was introduced in 2.6.23)

One of the major differences, is that normal tasks, i.e. tasks with priority between 139 and 100, are handled by CFS (SCHED_NORMAL, SCHED_BATCH and SCHED_IDLE). pri 99 to 1 (or 0, if you count in the top-priority real-time kernel priority not accessible from user-space), is handled by SCHED_RR/SCHED_FIFO.

Each CPU has its own runqueue (struct rq), which have a pointer to the real-time runqueue and to cfs_rq, the red-black tree containing all normal tasks.
The run-queue for real-time tasks is very similar to the old O(1) scheduler, except that it has 100 slots instead of 140, and consist of only one queue, not 2 (active and expired) as O(1).

<offtopic>
does anyone know how the old O(1) handled real-time tasks wrt to the 2 queues and active/expired swapping?
</offtopic>

> I have a questions related to "priority imbalance". The article mentions
> that each CPU has a runqueue made up of 140 priority lists i.e. per CPU
> ready queue. Hence it is possible that a runnable high priority task waits
> for a CPU in ready queue while a lower priority task is running on a
> different CPU. This can lead to priority imbalance/priority inversion. How
> does linux SMP scheduler handle this priority imbalance?

Well, for a starter, to balance tasks perfectly on n processors, is NP-hard. This is due to the bin-packing problem, as the scheduler must take all tasks, calculate the load and try to distribute them across all processors, setting the capacity of each CPU to total_load / CPUs

Every so often, a scheduler tick will happen (even though CFS is tickless), and when this happens, the currently running task may be reinserted into the run-queue and a new task selected, if this is appropriate.

And this is basically the crux of the load-balancing - the scheduler will not try to balance the load if it has tasks to run, only when it is idle! The load_balance *does not* push a task to another CPU, it is used to retrieve a task from another CPU. So, when CPU A is idle, it will see if CPU B has a task it can take in order to balance the load. 

Well, to confuse you a bit, tasks *can* be pushed, by migration threads (see active_load_balance in sched.c), but it's more realistic that scheduler_tick will case the load_balancing functionality on an idle CPU.

>From kernel/sched.c:
/*
 * Trigger the SCHED_SOFTIRQ if it is time to do periodic load balancing.
 *
 * In case of CONFIG_NO_HZ, this is the place where we nominate a new
 * idle load balancing owner or decide to stop the periodic load balancing,
 * if the whole system is idle.
 */
static inline void trigger_load_balance(struct rq *rq, int cpu)
{
*snip* ...

On top of this, the balancing routine will first look at the group and then the enclosing scheduling domain, recursively untill the global/topmost domain is reached.

> 
> Thanks in advance for you answer.
> Regards

Normal rules apply - I'm a newbie myself, so others may want to correct me :-)

-- 
Med vennlig hilsen - Yours Sincerely
Henrik Austad

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ