The patch titled CFS scheduler: v14 has been added to the -mm tree. Its filename is cfs-scheduler-v14-rc2-mm1.patch *** Remember to use Documentation/SubmitChecklist when testing your code *** See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find out what to do about this ------------------------------------------------------ Subject: CFS scheduler: v14 From: Ingo Molnar <mingo@xxxxxxx> CFS -v14: - increase sleeper-fairness (Mike Galbraith, me) - CFS documentation fixes (Pranith Kumar D) - increased the default rescheduling granularity to 3msecs on UP, 6 msecs on 2-way systems - small update_curr() precision fix - added an overview section to Documentation/sched-design-CFS.txt Signed-off-by: Ingo Molnar <mingo@xxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- Documentation/sched-design-CFS.txt | 98 +++++++++++++++------------ kernel/sched_fair.c | 73 +++++++++++++++++--- 2 files changed, 117 insertions(+), 54 deletions(-) diff -puN Documentation/sched-design-CFS.txt~cfs-scheduler-v14-rc2-mm1 Documentation/sched-design-CFS.txt --- a/Documentation/sched-design-CFS.txt~cfs-scheduler-v14-rc2-mm1 +++ a/Documentation/sched-design-CFS.txt @@ -1,20 +1,59 @@ -[announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] -i'm pleased to announce the first release of the "Modular Scheduler Core -and Completely Fair Scheduler [CFS]" patchset: +This is the CFS scheduler. - http://redhat.com/~mingo/cfs-scheduler/ +80% of CFS's design can be summed up in a single sentence: CFS basically +models an "ideal, precise multi-tasking CPU" on real hardware. -This project is a complete rewrite of the Linux task scheduler. My goal -is to address various feature requests and to fix deficiencies in the -vanilla scheduler that were suggested/found in the past few years, both -for desktop scheduling and for server scheduling workloads. +"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% +physical power and which can run each task at precise equal speed, in +parallel, each at 1/nr_running speed. For example: if there are 2 tasks +running then it runs each at 50% physical power - totally in parallel. + +On real hardware, we can run only a single task at once, so while that +one task runs, the other tasks that are waiting for the CPU are at a +disadvantage - the current task gets an unfair amount of CPU time. In +CFS this fairness imbalance is expressed and tracked via the per-task +p->wait_runtime (nanosec-unit) value. "wait_runtime" is the amount of +time the task should now run on the CPU for it to become completely fair +and balanced. + +( small detail: on 'ideal' hardware, the p->wait_runtime value would + always be zero - no task would ever get 'out of balance' from the + 'ideal' share of CPU time. ) + +CFS's task picking logic is based on this p->wait_runtime value and it +is thus very simple: it always tries to run the task with the largest +p->wait_runtime value. In other words, CFS tries to run the task with +the 'gravest need' for more CPU time. So CFS always tries to split up +CPU time between runnable tasks as close to 'ideal multitasking +hardware' as possible. + +Most of the rest of CFS's design just falls out of this really simple +concept, with a few add-on embellishments like nice levels, +multiprocessing and various algorithm variants to recognize sleepers. + +In practice it works like this: the system runs a task a bit, and when +the task schedules (or a scheduler tick happens) the task's CPU usage is +'accounted for': the (small) time it just spent using the physical CPU +is deducted from p->wait_runtime. [minus the 'fair share' it would have +gotten anyway]. Once p->wait_runtime gets low enough so that another +task becomes the 'leftmost task' of the time-ordered rbtree it maintains +(plus a small amount of 'granularity' distance relative to the leftmost +task so that we do not over-schedule tasks and trash the cache) then the +new leftmost task is picked and the current task is preempted. + +The rq->fair_clock value tracks the 'CPU time a runnable task would have +fairly gotten, had it been runnable during that time'. So by using +rq->fair_clock values we can accurately timestamp and measure the +'expected CPU time' a task should have gotten. All runnable tasks are +sorted in the rbtree by the "rq->fair_clock - p->wait_runtime" key, and +CFS picks the 'leftmost' task and sticks to it. As the system progresses +forwards, newly woken tasks are put into the tree more and more to the +right - slowly but surely giving a chance for every task to become the +'leftmost task' and thus get on the CPU within a deterministic amount of +time. -[ QuickStart: apply the patch, recompile, reboot. The new scheduler - will be active by default and all tasks will default to the - SCHED_NORMAL interactive scheduling class. ] - -Highlights are: +Some implementation details: - the introduction of Scheduling Classes: an extensible hierarchy of scheduler modules. These modules encapsulate scheduling policy @@ -25,7 +64,7 @@ Highlights are: replacement for the vanilla scheduler's SCHED_OTHER interactivity code. - i'd like to give credit to Con Kolivas for the general approach here: + I'd like to give credit to Con Kolivas for the general approach here: he has proven via RSDL/SD that 'fair scheduling' is possible and that it results in better desktop scheduling. Kudos Con! @@ -53,7 +92,7 @@ Highlights are: setting suitable for desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too. - due to its design, the CFS scheduler is not prone to any of the + Due to its design, the CFS scheduler is not prone to any of the 'attacks' that exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all work fine and do not impact interactivity and produce the expected @@ -63,7 +102,7 @@ Highlights are: SCHED_BATCH: both types of workloads should be isolated much more agressively than under the vanilla scheduler. - ( another rdetail: due to nanosec accounting and timeline sorting, + ( another detail: due to nanosec accounting and timeline sorting, sched_yield() support is very simple under CFS, and in fact under CFS sched_yield() behaves much better than under any other scheduler i have tested so far. ) @@ -78,30 +117,3 @@ Highlights are: iterators of the scheduling modules are used. The balancing code got quite a bit simpler as a result. -the core scheduler got smaller by more than 700 lines: - - kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------ - 1 file changed, 372 insertions(+), 1082 deletions(-) - -and even adding all the scheduling modules, the total size impact is -relatively small: - - 18 files changed, 1454 insertions(+), 1133 deletions(-) - -most of the increase is due to extensive comments. The kernel size -impact is in fact a small negative: - - text data bss dec hex filename - 23366 4001 24 27391 6aff kernel/sched.o.vanilla - 24159 2705 56 26920 6928 kernel/sched.o.CFS - -(this is mainly due to the benefit of getting rid of the expired array -and its data structure overhead.) - -thanks go to Thomas Gleixner and Arjan van de Ven for review of this -patchset. - -as usual, any sort of feedback, bugreports, fixes and suggestions are -more than welcome, - - Ingo diff -puN kernel/sched_fair.c~cfs-scheduler-v14-rc2-mm1 kernel/sched_fair.c --- a/kernel/sched_fair.c~cfs-scheduler-v14-rc2-mm1 +++ a/kernel/sched_fair.c @@ -16,11 +16,11 @@ * number of CPUs. (i.e. factor 2x on 2-way systems, 3x on 4-way * systems, 4x on 8-way systems, 5x on 16-way systems, etc.) */ -unsigned int sysctl_sched_granularity __read_mostly = 1000000000/HZ; +unsigned int sysctl_sched_granularity __read_mostly = 3000000000ULL/HZ; /* * Wake-up granularity. - * (default: 1 msec, units: nanoseconds) + * (default: 0, units: nanoseconds) * * This option delays the preemption effects of decoupled workloads * and reduces their over-scheduling. Synchronous workloads will still @@ -28,9 +28,9 @@ unsigned int sysctl_sched_granularity __ */ unsigned int sysctl_sched_wakeup_granularity __read_mostly = 0; -unsigned int sysctl_sched_runtime_limit __read_mostly = 2000000000/HZ; +unsigned int sysctl_sched_runtime_limit __read_mostly = 6000000000ULL/HZ; -unsigned int sysctl_sched_load_smoothing __read_mostly = 0; +unsigned int sysctl_sched_load_smoothing __read_mostly = 8 | 16 | 32; /* * sys_sched_yield unfairness bug workaround switch. @@ -192,8 +192,8 @@ static inline void update_curr(struct rq u64 delta_exec, delta_fair, delta_mine; struct task_struct *curr = rq->curr; - if (curr->sched_class != &fair_sched_class || curr == rq->idle - || !curr->on_rq) + if (curr->sched_class != &fair_sched_class || + !rq->raw_weighted_load || curr == rq->idle) return; /* * Get the amount of time the current task was running @@ -384,6 +384,50 @@ update_stats_curr_end(struct rq *rq, str p->exec_start = 0; } +long div64_s(s64 divident, unsigned long divisor) +{ + u64 tmp; + + if (divident < 0) { + tmp = -divident; + do_div(tmp, divisor); + return -(s64)tmp; + } else { + tmp = divident; + do_div(tmp, divisor); + return (s64)tmp; + } +} + +/* + * A task gets added back to the runnable tasks and gets + * a small credit for the CPU time it missed out on while + * it slept, so fix up all other runnable task's wait_runtime + * so that the sum stays constant (around 0). + * + * Instead of looping over all runnable tasks in an O(N) + * manner we move the fair clock back by a proportional + * amount of the new wait_runtime this task adds to the pool. + */ +static void distribute_fair_add(struct rq *rq, s64 delta) +{ + struct task_struct *curr = rq->curr; + s64 delta_fair = 0; + + if (!(sysctl_sched_load_smoothing & 32)) + return; + + if (rq->nr_running) { + delta_fair = div64_s(delta, rq->nr_running); + /* + * The currently running task's next wait_runtime value does + * not depend on the fair_clock, so fix it up explicitly: + */ + add_wait_runtime(rq, curr, -delta_fair); + rq->fair_clock -= delta_fair; + } +} + /**************************************************************/ /* Scheduling class queueing methods: */ @@ -391,7 +435,7 @@ update_stats_curr_end(struct rq *rq, str static void enqueue_sleeper(struct rq *rq, struct task_struct *p) { unsigned long load = get_rq_load(rq); - u64 delta_fair = 0; + s64 delta_fair, prev_runtime; if (!(sysctl_sched_load_smoothing & 16)) goto out; @@ -404,12 +448,19 @@ static void enqueue_sleeper(struct rq *r * Fix up delta_fair with the effect of us running * during the whole sleep period: */ - if (sysctl_sched_load_smoothing & 8) { - delta_fair = delta_fair * load; - do_div(delta_fair, load + p->load_weight); - } + if (sysctl_sched_load_smoothing & 8) + delta_fair = div64_s(delta_fair * load, load + p->load_weight); + prev_runtime = p->wait_runtime; __add_wait_runtime(rq, p, delta_fair); + delta_fair = p->wait_runtime - prev_runtime; + + /* + * We move the fair clock back by a load-proportional + * amount of the new wait_runtime this task adds to + * the 'pool': + */ + distribute_fair_add(rq, delta_fair); out: rq->wait_runtime += p->wait_runtime; _ Patches currently in -mm which might be from mingo@xxxxxxx are git-kvm.patch prevent-going-idle-with-softirq-pending.patch fix-crash-with-irqpoll-due-to-the-irqf_irqpoll-flag.patch only-allow-nonlinear-vmas-for-ram-backed-filesystems.patch cpuset-remove-sched-domain-hooks-from-cpusets.patch introduce-write_trylock_irqsave.patch use-write_trylock_irqsave-in-ptrace_attach.patch fix-stop_machine_run-problem-with-naughty-real-time-process.patch cpu-hotplug-fix-ksoftirqd-termination-on-cpu-hotplug-with-naughty-realtime-process.patch cpu-hotplug-fix-ksoftirqd-termination-on-cpu-hotplug-with-naughty-realtime-process-fix.patch pie-randomization.patch cfs-scheduler.patch cfs-scheduler-vs-detach-schedh-from-mmh.patch cfs-scheduler-v14-rc2-mm1.patch cfs-scheduler-warning-fixes.patch sched-add-above-background-load-function.patch mm-implement-swap-prefetching.patch detect-atomic-counter-underflows.patch make-frame_pointer-default=y.patch mutex-subsystem-synchro-test-module.patch vdso-print-fatal-signals.patch lockdep-show-held-locks-when-showing-a-stackdump.patch kmap_atomic-debugging.patch random-warning-squishes.patch - To unsubscribe from this list: send the line "unsubscribe mm-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html