Hi, I've ported some parts of the latest 2.5.6x scheduler to 2.4.20-ck4. I also included variable-hz again for 1000 Hz (to match 2.5) as well as sched-tunables. I'm not sure how correct it is, but it seems to work well. I made these against ck4-rmap15d with rmap15e incremental patch, ignoring the elevator.h unpatches in the 15e incremental. Contest benchmarks in another email. diff -ruNp a/Documentation/Configure.help b/Documentation/Configure.help --- a/Documentation/Configure.help 2003-04-03 21:31:34.000000000 -0800 +++ b/Documentation/Configure.help 2003-04-03 23:15:05.000000000 -0800 @@ -2439,6 +2439,18 @@ CONFIG_HEARTBEAT behaviour is platform-dependent, but normally the flash frequency is a hyperbolic function of the 5-minute load average. +Timer frequency +CONFIG_HZ + The frequency the system timer interrupt pops. Higher tick values provide + improved granularity of timers, improved select() and poll() performance, + and lower scheduling latency. Higher values, however, increase interrupt + overhead and will allow jiffie wraparound sooner. For compatibility, the + tick count is always exported as if HZ=100. + + The default value, which was the value for all of eternity, is 100. If + you are looking to provide better timer granularity or increased desktop + performance, try 500 or 1000. In unsure, go with the default of 100. + Networking support CONFIG_NET Unless you really know what you are doing, you should say Y here. diff -ruNp a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt --- a/Documentation/filesystems/proc.txt 2002-12-09 02:24:08.000000000 -0800 +++ b/Documentation/filesystems/proc.txt 2003-04-03 23:10:53.000000000 -0800 @@ -37,6 +37,7 @@ Table of Contents 2.8 /proc/sys/net/ipv4 - IPV4 settings 2.9 Appletalk 2.10 IPX + 2.11 /proc/sys/sched - scheduler tunables ------------------------------------------------------------------------------ Preface @@ -1779,6 +1780,92 @@ The /proc/net/ipx_route table holds a gives the destination network, the router node (or Directly) and the network address of the router (or Connected) for internal networks. +2.11 /proc/sys/sched - scheduler tunables +----------------------------------------- + +Useful knobs for tuning the scheduler live in /proc/sys/sched. + +child_penalty +------------- + +Percentage of the parent's sleep_avg that children inherit. sleep_avg is +a running average of the time a process spends sleeping. Tasks with high +sleep_avg values are considered interactive and given a higher dynamic +priority and a larger timeslice. You typically want this some value just +under 100. + +exit_weight +----------- + +When a CPU hog task exits, its parent's sleep_avg is reduced by a factor of +exit_weight against the exiting task's sleep_avg. + +interactive_delta +----------------- + +If a task is "interactive" it is reinserted into the active array after it +has expired its timeslice, instead of being inserted into the expired array. +How "interactive" a task must be in order to be deemed interactive is a +function of its nice value. This interactive limit is scaled linearly by nice +value and is offset by the interactive_delta. + +max_sleep_avg +------------- + +max_sleep_avg is the largest value (in ms) stored for a task's running sleep +average. The larger this value, the longer a task needs to sleep to be +considered interactive (maximum interactive bonus is a function of +max_sleep_avg). + +max_timeslice +------------- + +Maximum timeslice, in milliseconds. This is the value given to tasks of the +highest dynamic priority. + +min_timeslice +------------- + +Minimum timeslice, in milliseconds. This is the value given to tasks of the +lowest dynamic priority. Every task gets at least this slice of the processor +per array switch. + +parent_penalty +-------------- + +Percentage of the parent's sleep_avg that it retains across a fork(). +sleep_avg is a running average of the time a process spends sleeping. Tasks +with high sleep_avg values are considered interactive and given a higher +dynamic priority and a larger timeslice. Normally, this value is 100 and thus +task's retain their sleep_avg on fork. If you want to punish interactive +tasks for forking, set this below 100. + +prio_bonus_ratio +---------------- + +Middle percentage of the priority range that tasks can receive as a dynamic +priority. The default value of 25% ensures that nice values at the +extremes are still enforced. For example, nice +19 interactive tasks will +never be able to preempt a nice 0 CPU hog. Setting this higher will increase +the size of the priority range the tasks can receive as a bonus. Setting +this lower will decrease this range, making the interactivity bonus less +apparent and user nice values more applicable. + +starvation_limit +---------------- + +Sufficiently interactive tasks are reinserted into the active array when they +run out of timeslice. Normally, tasks are inserted into the expired array. +Reinserting interactive tasks into the active array allows them to remain +runnable, which is important to interactive performance. This could starve +expired tasks, however, since the interactive task could prevent the array +switch. To prevent starving the tasks on the expired array for too long. the +starvation_limit is the longest (in ms) we will let the expired array starve +at the expense of reinserting interactive tasks back into active. Higher +values here give more preferance to running interactive tasks, at the expense +of expired tasks. Lower values provide more fair scheduling behavior, at the +expense of interactivity. The units are in milliseconds. + ------------------------------------------------------------------------------ Summary ------------------------------------------------------------------------------ diff -ruNp a/arch/i386/config.in b/arch/i386/config.in --- a/arch/i386/config.in 2003-04-03 21:33:54.000000000 -0800 +++ b/arch/i386/config.in 2003-04-03 23:15:05.000000000 -0800 @@ -240,6 +240,7 @@ endmenu mainmenu_option next_comment comment 'General setup' +int 'Timer frequency (HZ) (100)' CONFIG_HZ 1000 bool 'Networking support' CONFIG_NET # Visual Workstation support is utterly broken. diff -ruNp a/fs/proc/array.c b/fs/proc/array.c --- a/fs/proc/array.c 2003-04-03 21:33:54.000000000 -0800 +++ b/fs/proc/array.c 2003-04-03 23:15:05.000000000 -0800 @@ -360,15 +360,15 @@ int proc_pid_stat(struct task_struct *ta task->cmin_flt, task->maj_flt, task->cmaj_flt, - task->times.tms_utime, - task->times.tms_stime, - task->times.tms_cutime, - task->times.tms_cstime, + jiffies_to_clock_t(task->times.tms_utime), + jiffies_to_clock_t(task->times.tms_stime), + jiffies_to_clock_t(task->times.tms_cutime), + jiffies_to_clock_t(task->times.tms_cstime), priority, nice, 0UL /* removed */, - task->it_real_value, - task->start_time, + jiffies_to_clock_t(task->it_real_value), + jiffies_to_clock_t(task->start_time), vsize, mm ? mm->rss : 0, /* you might want to shift this left 3 */ task->rlim[RLIMIT_RSS].rlim_cur, @@ -687,14 +687,14 @@ int proc_pid_cpu(struct task_struct *tas len = sprintf(buffer, "cpu %lu %lu\n", - task->times.tms_utime, - task->times.tms_stime); + jiffies_to_clock_t(task->times.tms_utime), + jiffies_to_clock_t(task->times.tms_stime)); for (i = 0 ; i < smp_num_cpus; i++) len += sprintf(buffer + len, "cpu%d %lu %lu\n", i, - task->per_cpu_utime[cpu_logical_map(i)], - task->per_cpu_stime[cpu_logical_map(i)]); + jiffies_to_clock_t(task->per_cpu_utime[cpu_logical_map(i)]), + jiffies_to_clock_t(task->per_cpu_stime[cpu_logical_map(i)])); return len; } diff -ruNp a/fs/proc/proc_misc.c b/fs/proc/proc_misc.c --- a/fs/proc/proc_misc.c 2003-04-03 21:33:54.000000000 -0800 +++ b/fs/proc/proc_misc.c 2003-04-03 23:15:05.000000000 -0800 @@ -316,16 +316,16 @@ static int kstat_read_proc(char *page, c { int i, len = 0; extern unsigned long total_forks; - unsigned long jif = jiffies; + unsigned long jif = jiffies_to_clock_t(jiffies); unsigned int sum = 0, user = 0, nice = 0, system = 0; int major, disk; for (i = 0 ; i < smp_num_cpus; i++) { int cpu = cpu_logical_map(i), j; - user += kstat.per_cpu_user[cpu]; - nice += kstat.per_cpu_nice[cpu]; - system += kstat.per_cpu_system[cpu]; + user += jiffies_to_clock_t(kstat.per_cpu_user[cpu]); + nice += jiffies_to_clock_t(kstat.per_cpu_nice[cpu]); + system += jiffies_to_clock_t(kstat.per_cpu_system[cpu]); #if !defined(CONFIG_ARCH_S390) for (j = 0 ; j < NR_IRQS ; j++) sum += kstat.irqs[cpu][j]; @@ -339,10 +339,10 @@ static int kstat_read_proc(char *page, c proc_sprintf(page, &off, &len, "cpu%d %u %u %u %lu\n", i, - kstat.per_cpu_user[cpu_logical_map(i)], - kstat.per_cpu_nice[cpu_logical_map(i)], - kstat.per_cpu_system[cpu_logical_map(i)], - jif - ( kstat.per_cpu_user[cpu_logical_map(i)] \ + jiffies_to_clock_t(kstat.per_cpu_user[cpu_logical_map(i)]), + jiffies_to_clock_t(kstat.per_cpu_nice[cpu_logical_map(i)]), + jiffies_to_clock_t(kstat.per_cpu_system[cpu_logical_map(i)]), + jif - jiffies_to_clock_t(kstat.per_cpu_user[cpu_logical_map(i)] \ + kstat.per_cpu_nice[cpu_logical_map(i)] \ + kstat.per_cpu_system[cpu_logical_map(i)])); proc_sprintf(page, &off, &len, diff -ruNp a/include/asm-i386/param.h b/include/asm-i386/param.h --- a/include/asm-i386/param.h 2000-10-27 11:04:43.000000000 -0700 +++ b/include/asm-i386/param.h 2003-04-03 23:15:05.000000000 -0800 @@ -1,8 +1,17 @@ #ifndef _ASMi386_PARAM_H #define _ASMi386_PARAM_H +#include <linux/config.h> + +#ifdef __KERNEL__ +# define HZ CONFIG_HZ /* internal kernel timer frequency */ +# define USER_HZ 100 /* some user interfaces are in ticks */ +# define CLOCKS_PER_SEC (USER_HZ) /* like times() */ +# define jiffies_to_clock_t(x) ((x) / ((HZ) / (USER_HZ))) +#endif + #ifndef HZ -#define HZ 100 +#define HZ 100 /* if userspace cheats, give them 100 */ #endif #define EXEC_PAGESIZE 4096 @@ -17,8 +26,4 @@ #define MAXHOSTNAMELEN 64 /* max length of hostname */ -#ifdef __KERNEL__ -# define CLOCKS_PER_SEC 100 /* frequency at which times() counts */ -#endif - #endif diff -ruNp a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h 2003-04-03 21:33:54.000000000 -0800 +++ b/include/linux/sched.h 2003-04-03 23:10:53.000000000 -0800 @@ -356,7 +356,7 @@ struct task_struct { prio_array_t *array; unsigned long sleep_avg; - unsigned long sleep_timestamp; + unsigned long last_run; unsigned long policy; unsigned long cpus_allowed; @@ -387,6 +387,7 @@ struct task_struct { * older sibling, respectively. (p->father can be replaced with * p->p_pptr->pid) */ + struct task_struct *parent; task_t *p_opptr, *p_pptr, *p_cptr, *p_ysptr, *p_osptr; struct list_head thread_group; diff -ruNp a/include/linux/sysctl.h b/include/linux/sysctl.h --- a/include/linux/sysctl.h 2003-04-03 21:33:54.000000000 -0800 +++ b/include/linux/sysctl.h 2003-04-03 23:10:53.000000000 -0800 @@ -63,7 +63,8 @@ enum CTL_DEV=7, /* Devices */ CTL_BUS=8, /* Busses */ CTL_ABI=9, /* Binary emulation */ - CTL_CPU=10 /* CPU stuff (speed scaling, etc) */ + CTL_CPU=10, /* CPU stuff (speed scaling, etc) */ + CTL_SCHED=11, /* scheduler tunables */ }; /* CTL_BUS names: */ @@ -148,6 +149,19 @@ enum VM_PAGEBUF=14, /* struct: Control pagebuf parameters */ }; +/* Tunable scheduler parameters in /proc/sys/sched/ */ +enum +{ + SCHED_MIN_TIMESLICE=1, /* minimum process timeslice */ + SCHED_MAX_TIMESLICE=2, /* maximum process timeslice */ + SCHED_CHILD_PENALTY=3, /* penalty on fork to child */ + SCHED_PARENT_PENALTY=4, /* penalty on fork to parent */ + SCHED_EXIT_WEIGHT=5, /* penalty to parent of CPU hog child */ + SCHED_PRIO_BONUS_RATIO=6, /* percent of max prio given as bonus */ + SCHED_INTERACTIVE_DELTA=7, /* delta used to scale interactivity */ + SCHED_MAX_SLEEP_AVG=8, /* maximum sleep avg attainable */ + SCHED_STARVATION_LIMIT=9, /* no re-active if expired is starved */ +}; /* CTL_NET names: */ enum diff -ruNp a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c 2003-04-03 21:33:54.000000000 -0800 +++ b/kernel/fork.c 2003-04-03 23:10:53.000000000 -0800 @@ -727,7 +727,7 @@ int do_fork(unsigned long clone_flags, u current->time_slice = 1; scheduler_tick(0,0); } - p->sleep_timestamp = jiffies; + p->last_run = jiffies; __sti(); /* diff -ruNp a/kernel/sched.c b/kernel/sched.c --- a/kernel/sched.c 2003-04-03 21:33:54.000000000 -0800 +++ b/kernel/sched.c 2003-04-03 23:10:53.000000000 -0800 @@ -52,15 +52,26 @@ * maximum timeslice is 300 msecs. Timeslices get refilled after * they expire. */ -#define MIN_TIMESLICE ( 10 * HZ / 1000 ) -#define MAX_TIMESLICE ( 1000 * HZ / 1000 ) -#define CHILD_PENALTY 95 -#define PARENT_PENALTY 100 -#define EXIT_WEIGHT 3 -#define PRIO_BONUS_RATIO 15 -#define INTERACTIVE_DELTA 4 -#define MAX_SLEEP_AVG (2*HZ) -#define STARVATION_LIMIT (3*HZ) +int min_timeslice = ((5 * HZ) / 1000 ?: 1); +int max_timeslice = (200 * HZ) / 1000; +int child_penalty = 50; +int parent_penalty = 100; +int exit_weight = 3; +int prio_bonus_ratio = 25; +int interactive_delta = 2; +int max_sleep_avg = 10 * HZ; +int starvation_limit = 10 * HZ; + +#define MIN_TIMESLICE (min_timeslice) +#define MAX_TIMESLICE (max_timeslice) +#define CHILD_PENALTY (child_penalty) +#define PARENT_PENALTY (parent_penalty) +#define EXIT_WEIGHT (exit_weight) +#define PRIO_BONUS_RATIO (prio_bonus_ratio) +#define INTERACTIVE_DELTA (interactive_delta) +#define MAX_SLEEP_AVG (max_sleep_avg) +#define STARVATION_LIMIT (starvation_limit) +#define TIMESLICE_GRANULARITY (HZ/20 ?: 1) /* * If a task is 'interactive' then we reinsert it in the active @@ -115,14 +126,19 @@ * downside in using shorter timeslices. */ -static inline unsigned int task_timeslice(task_t *p) +#define BASE_TIMESLICE(p) \ + (MAX_TIMESLICE * (MAX_PRIO-(p)->static_prio)/MAX_USER_PRIO) + +static unsigned int task_timeslice(task_t *p) { - if (p->policy == SCHED_BATCH) - return MAX_TIMESLICE; - else - return MIN_TIMESLICE; -} + unsigned int time_slice = BASE_TIMESLICE(p); + + if (time_slice < MIN_TIMESLICE) + time_slice = MIN_TIMESLICE; + return time_slice; +} + /* * These are the runqueue data structures: */ @@ -149,6 +165,7 @@ struct runqueue { unsigned long nr_running, nr_switches, expired_timestamp, nr_uninterruptible; task_t *curr, *idle; + struct mm_struct *prev_mm; prio_array_t *active, *expired, arrays[2]; int prev_nr_running[NR_CPUS]; @@ -191,6 +208,10 @@ static struct runqueue runqueues[NR_CPUS # define task_running(rq, p) ((rq)->curr == (p)) #endif +# define nr_running_init(rq) do { } while (0) +# define nr_running_inc(rq) do { (rq)->nr_running++; } while (0) +# define nr_running_dec(rq) do { (rq)->nr_running--; } while (0) + /* * task_rq_lock - lock the runqueue a given task resides on and disable * interrupts. Note the ordering: we can safely lookup the task_rq without @@ -273,6 +294,9 @@ static inline int effective_prio(task_t * * Both properties are important to certain workloads. */ + if (rt_task(p)) + return p->prio; + bonus = MAX_USER_PRIO*PRIO_BONUS_RATIO*p->sleep_avg/MAX_SLEEP_AVG/100 - MAX_USER_PRIO*PRIO_BONUS_RATIO/100/2; @@ -284,27 +308,58 @@ static inline int effective_prio(task_t return prio; } -static inline void activate_task(task_t *p, runqueue_t *rq) +static inline void __activate_task(task_t *p, runqueue_t *rq) { - unsigned long sleep_time = jiffies - p->sleep_timestamp; - prio_array_t *array = rq->active; + enqueue_task(p, rq->active); + nr_running_inc(rq); +} - if (!rt_task(p) && sleep_time) { - /* - * This code gives a bonus to interactive tasks. We update - * an 'average sleep time' value here, based on - * sleep_timestamp. The more time a task spends sleeping, - * the higher the average gets - and the higher the priority - * boost gets as well. - */ - p->sleep_avg += sleep_time; - if (p->sleep_avg > MAX_SLEEP_AVG) - p->sleep_avg = MAX_SLEEP_AVG; - p->prio = effective_prio(p); +static inline int activate_task(task_t *p, runqueue_t *rq) +{ + long sleep_time = jiffies - p->last_run - 1; + int requeue_waker = 0; + + if (sleep_time > 0) { + int sleep_avg; + + /* + * This code gives a bonus to interactive tasks. + * + * The boost works by updating the 'average sleep time' + * value here, based on ->last_run. The more time a task + * spends sleeping, the higher the average gets - and the + * higher the priority boost gets as well. + */ + sleep_avg = p->sleep_avg + sleep_time; + + /* + * 'Overflow' bonus ticks go to the waker as well, so the + * ticks are not lost. This has the effect of further + * boosting tasks that are related to maximum-interactive + * tasks. + */ + if (sleep_avg > MAX_SLEEP_AVG) { + if (!in_interrupt()) { + sleep_avg += current->sleep_avg - MAX_SLEEP_AVG; + if (sleep_avg > MAX_SLEEP_AVG) + sleep_avg = MAX_SLEEP_AVG; + + if (current->sleep_avg != sleep_avg) { + current->sleep_avg = sleep_avg; + requeue_waker = 1; + } + } + sleep_avg = MAX_SLEEP_AVG; + } + if (p->sleep_avg != sleep_avg) { + p->sleep_avg = sleep_avg; + p->prio = effective_prio(p); } - enqueue_task(p, array); - rq->nr_running++; } + __activate_task(p, rq); + + return requeue_waker; +} static inline void activate_batch_task(task_t *p, runqueue_t *rq) { @@ -316,7 +371,7 @@ static inline void activate_batch_task(t static inline void deactivate_task(struct task_struct *p, runqueue_t *rq) { - rq->nr_running--; + nr_running_dec(rq); if (p->state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible++; dequeue_task(p, p->array); @@ -378,7 +433,7 @@ static inline void resched_task(task_t * * ptrace() code. */ void wait_task_inactive(task_t * p) - { +{ unsigned long flags; runqueue_t *rq; @@ -419,23 +474,8 @@ repeat: */ void kick_if_running(task_t * p) { - if (task_running(task_rq(p), p) && (p->cpu != smp_processor_id())) + if (task_running(task_rq(p), p) && (task_cpu(p) != smp_processor_id())) resched_task(p); - /* - * If batch processes get signals but are not running currently - * then give them a chance to handle the signal. (the kernel - * side signal handling code will run for sure, the userspace - * part depends on system load and might be delayed indefinitely.) - */ - if (p->policy == SCHED_BATCH) { - unsigned long flags; - runqueue_t *rq; - - rq = task_rq_lock(p, &flags); - if (p->flags & PF_BATCH) - activate_batch_task(p, rq); - task_rq_unlock(rq, &flags); - } } /* @@ -449,70 +489,99 @@ void kick_if_running(task_t * p) * returns failure only if the task is already active. */ -static int try_to_wake_up(task_t * p, int sync) +static int try_to_wake_up(task_t * p, unsigned int state, int sync) { + int success = 0, requeue_waker = 0; unsigned long flags; - int success = 0; long old_state; runqueue_t *rq; repeat_lock_task: rq = task_rq_lock(p, &flags); old_state = p->state; - if (!p->array) { - /* - * Fast-migrate the task if it's not running or runnable - * currently. Do not violate hard affinity. - */ - if (unlikely(sync && !task_running(rq, p) && - (task_cpu(p) != smp_processor_id()) && - (p->cpus_allowed & (1UL << smp_processor_id())))) { - - set_task_cpu(p, smp_processor_id()); + if (old_state & state) { + if (!p->array) { + /* + * Fast-migrate the task if it's not running or runnable + * currently. Do not violate hard affinity. + */ + if (unlikely(sync && !task_running(rq, p) && + (task_cpu(p) != smp_processor_id()) && + (p->cpus_allowed & (1UL << smp_processor_id())))) { + + set_task_cpu(p, smp_processor_id()); + + task_rq_unlock(rq, &flags); + goto repeat_lock_task; + } + if (old_state == TASK_UNINTERRUPTIBLE) + rq->nr_uninterruptible--; - task_rq_unlock(rq, &flags); - goto repeat_lock_task; + if (sync) + __activate_task(p, rq); + else { + requeue_waker = activate_task(p, rq); + if (p->prio < rq->curr->prio) + resched_task(rq->curr); + } + success = 1; } - if (old_state == TASK_UNINTERRUPTIBLE) - rq->nr_uninterruptible--; - activate_task(p, rq); - - if (p->prio < rq->curr->prio || rq->curr->policy == SCHED_BATCH) - resched_task(rq->curr); - success = 1; + p->state = TASK_RUNNING; } - p->state = TASK_RUNNING; task_rq_unlock(rq, &flags); + /* + * We have to do this outside the other spinlock, the two + * runqueues might be different: + */ + if (requeue_waker) { + prio_array_t *array; + + rq = task_rq_lock(current, &flags); + array = current->array; + dequeue_task(current, array); + current->prio = effective_prio(current); + enqueue_task(current, array); + task_rq_unlock(rq, &flags); + } + return success; } int wake_up_process(task_t * p) { - return try_to_wake_up(p, 0); + return try_to_wake_up(p, TASK_STOPPED | TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0); } void wake_up_forked_process(task_t * p) { - runqueue_t *rq ; + runqueue_t *rq; + unsigned long flags; preempt_disable(); - rq = this_rq_lock(); + + rq = task_rq_lock(current, &flags); p->state = TASK_RUNNING; - if (!rt_task(p)) { - /* - * We decrease the sleep average of forking parents - * and children as well, to keep max-interactive tasks - * from forking tasks that are max-interactive. - */ - current->sleep_avg = current->sleep_avg * PARENT_PENALTY / 100; - p->sleep_avg = p->sleep_avg * CHILD_PENALTY / 100; - p->prio = effective_prio(p); -} + /* + * We decrease the sleep average of forking parents + * and children as well, to keep max-interactive tasks + * from forking tasks that are max-interactive. + */ + current->sleep_avg = current->sleep_avg * PARENT_PENALTY / 100; + p->sleep_avg = p->sleep_avg * CHILD_PENALTY / 100; + p->prio = effective_prio(p); set_task_cpu(p, smp_processor_id()); - activate_task(p, rq); - rq_unlock(rq); + if (unlikely(!current->array)) + __activate_task(p, rq); + else { + p->prio = current->prio; + list_add_tail(&p->run_list, ¤t->run_list); + p->array = current->array; + p->array->nr_active++; + nr_running_inc(rq); + } + task_rq_unlock(rq, &flags); preempt_enable(); } @@ -527,13 +596,15 @@ void wake_up_forked_process(task_t * p) */ void sched_exit(task_t * p) { - __cli(); + unsigned long flags; + + local_irq_save(flags); if (p->first_time_slice) { current->time_slice += p->time_slice; if (unlikely(current->time_slice > MAX_TIMESLICE)) current->time_slice = MAX_TIMESLICE; } - __sti(); + local_irq_restore(flags); /* * If the child was a (relative-) CPU hog then decrease * the sleep_avg of the parent as well. @@ -550,7 +621,7 @@ asmlinkage void schedule_tail(task_t *pr } #endif -static inline task_t * context_switch(task_t *prev, task_t *next) +static inline task_t * context_switch(runqueue_t *rq, task_t *prev, task_t *next) { struct mm_struct *mm = next->mm; struct mm_struct *oldmm = prev->active_mm; @@ -564,7 +635,7 @@ static inline task_t * context_switch(ta if (unlikely(!prev->mm)) { prev->active_mm = NULL; - mmdrop(oldmm); + rq->prev_mm = oldmm; } /* Here we just switch the register state and the stack. */ @@ -824,9 +895,9 @@ static inline runqueue_t *find_busiest_q static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu) { dequeue_task(p, src_array); - src_rq->nr_running--; + nr_running_dec(src_rq); set_task_cpu(p, this_cpu); - this_rq->nr_running++; + nr_running_inc(this_rq); enqueue_task(p, this_rq->active); /* * Note that idle threads have a prio of MAX_PRIO, for this test @@ -834,6 +905,11 @@ static inline void pull_task(runqueue_t */ if (p->prio < this_rq->curr->prio) set_need_resched(); + else { + if (p->prio == this_rq->curr->prio && + p->time_slice > this_rq->curr->time_slice) + set_need_resched(); + } } /* @@ -896,7 +972,7 @@ skip_queue: */ #define CAN_MIGRATE_TASK(p,rq,this_cpu) \ - ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ + ((idle || (jiffies - (p)->last_run > cache_decay_ticks)) && \ !task_running(rq, p) && \ ((p)->cpus_allowed & (1UL << (this_cpu)))) @@ -954,9 +1030,9 @@ static inline void idle_tick(runqueue_t * increasing number of running tasks: */ #define EXPIRED_STARVING(rq) \ - ((rq)->expired_timestamp && \ + (STARVATION_LIMIT && ((rq)->expired_timestamp && \ (jiffies - (rq)->expired_timestamp >= \ - STARVATION_LIMIT * ((rq)->nr_running) + 1)) + STARVATION_LIMIT * ((rq)->nr_running) + 1 ))) /* * This function gets called by the timer code, with HZ frequency. @@ -985,7 +1061,7 @@ void scheduler_tick(int user_ticks, int } } - if (p == rq->idle || p->policy == SCHED_BATCH) + if (p == rq->idle) rq->idle_count++; #endif if (p == rq->idle) { @@ -996,7 +1072,7 @@ void scheduler_tick(int user_ticks, int #endif return; } - if (TASK_NICE(p) > 0 || p->policy == SCHED_BATCH) + if (TASK_NICE(p) > 0) kstat.per_cpu_nice[cpu] += user_ticks; else kstat.per_cpu_user[cpu] += user_ticks; @@ -1008,6 +1084,17 @@ void scheduler_tick(int user_ticks, int return; } spin_lock(&rq->lock); + /* + * The task was running during this tick - update the + * time slice counter and the sleep average. Note: we + * do not update a process's priority until it either + * goes to sleep or uses up its timeslice. This makes + * it possible for interactive tasks to use up their + * timeslices at their highest priority levels. + */ + if (p->sleep_avg) + p->sleep_avg--; + if (unlikely(rt_task(p))) { /* * RR tasks need a special form of timeslice management. @@ -1024,16 +1111,6 @@ void scheduler_tick(int user_ticks, int } goto out; } - /* - * The task was running during this tick - update the - * time slice counter and the sleep average. Note: we - * do not update a process's priority until it either - * goes to sleep or uses up its timeslice. This makes - * it possible for interactive tasks to use up their - * timeslices at their highest priority levels. - */ - if (p->sleep_avg) - p->sleep_avg--; if (!--p->time_slice) { dequeue_task(p, rq->active); set_tsk_need_resched(p); @@ -1047,6 +1124,28 @@ void scheduler_tick(int user_ticks, int enqueue_task(p, rq->expired); } else enqueue_task(p, rq->active); + } else { + /* + * Prevent a too long timeslice allowing a task to monopolize + * the CPU. We do this by splitting up the timeslice into + * smaller pieces. + * + * Note: this does not mean the task's timeslices expire or + * get lost in any way, they just might be preempted by + * another task of equal priority. (one with higher + * priority would have preempted this task already.) We + * requeue this task to the end of the list on this priority + * level, which is in essence a round-robin of tasks with + * equal priority. + */ + if (!(p->time_slice % TIMESLICE_GRANULARITY) && + (p->array == rq->active)) { + dequeue_task(p, rq->active); + set_tsk_need_resched(p); + p->prio = effective_prio(p); + enqueue_task(p, rq->active); + } + } out: #if CONFIG_SMP @@ -1107,7 +1206,7 @@ need_resched: rq = this_rq(); release_kernel_lock(prev, smp_processor_id()); - prev->sleep_timestamp = jiffies; + prev->last_run = jiffies; spin_lock_irq(&rq->lock); /* @@ -1173,7 +1272,7 @@ switch_tasks: rq->curr = next; prepare_arch_switch(rq, next); - prev = context_switch(prev, next); + prev = context_switch(rq, prev, next); barrier(); rq = this_rq(); finish_arch_switch(rq, prev); @@ -1230,7 +1337,7 @@ static inline void __wake_up_common(wait curr = list_entry(tmp, wait_queue_t, task_list); p = curr->task; state = p->state; - if ((state & mode) && try_to_wake_up(p, sync) && + if ((state & mode) && try_to_wake_up(p, state, sync) && ((curr->flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)) break; } @@ -1443,7 +1550,7 @@ asmlinkage long sys_nice(int increment) */ int task_prio(task_t *p) { - return p->prio - MAX_USER_RT_PRIO; + return p->prio - MAX_RT_PRIO; } int task_nice(task_t *p) @@ -1536,7 +1643,7 @@ static int setscheduler(pid_t pid, int p else p->prio = p->static_prio; if (array) - activate_task(p, task_rq(p)); + __activate_task(p, task_rq(p)); out_unlock: task_rq_unlock(rq, &flags); @@ -2221,7 +2328,7 @@ void __init sched_init(void) rq->curr = current; rq->idle = current; set_task_cpu(current, smp_processor_id()); - wake_up_process(current); + wake_up_forked_process(current); init_timervecs(); init_bh(TIMER_BH, timer_bh); diff -ruNp a/kernel/signal.c b/kernel/signal.c --- a/kernel/signal.c 2003-04-03 21:33:54.000000000 -0800 +++ b/kernel/signal.c 2003-04-03 23:15:05.000000000 -0800 @@ -13,7 +13,7 @@ #include <linux/smp_lock.h> #include <linux/init.h> #include <linux/sched.h> - +#include <asm/param.h> #include <asm/uaccess.h> /* @@ -775,8 +775,8 @@ void do_notify_parent(struct task_struct info.si_uid = tsk->uid; /* FIXME: find out whether or not this is supposed to be c*time. */ - info.si_utime = tsk->times.tms_utime; - info.si_stime = tsk->times.tms_stime; + info.si_utime = jiffies_to_clock_t(tsk->times.tms_utime); + info.si_stime = jiffies_to_clock_t(tsk->times.tms_stime); status = tsk->exit_code & 0x7f; why = SI_KERNEL; /* shouldn't happen */ diff -ruNp a/kernel/sys.c b/kernel/sys.c --- a/kernel/sys.c 2003-04-03 21:33:54.000000000 -0800 +++ b/kernel/sys.c 2003-04-03 23:15:05.000000000 -0800 @@ -14,7 +14,7 @@ #include <linux/prctl.h> #include <linux/init.h> #include <linux/highuid.h> - +#include <asm/param.h> #include <asm/uaccess.h> #include <asm/io.h> @@ -791,16 +791,23 @@ asmlinkage long sys_setfsgid(gid_t gid) asmlinkage long sys_times(struct tms * tbuf) { + struct tms temp; + /* * In the SMP world we might just be unlucky and have one of * the times increment as we use it. Since the value is an * atomically safe type this is just fine. Conceptually its * as if the syscall took an instant longer to occur. */ - if (tbuf) - if (copy_to_user(tbuf, ¤t->times, sizeof(struct tms))) + if (tbuf) { + temp.tms_utime = jiffies_to_clock_t(current->times.tms_utime); + temp.tms_stime = jiffies_to_clock_t(current->times.tms_stime); + temp.tms_cutime = jiffies_to_clock_t(current->times.tms_cutime); + temp.tms_cstime = jiffies_to_clock_t(current->times.tms_cstime); + if (copy_to_user(tbuf, &temp, sizeof(struct tms))) return -EFAULT; - return jiffies; + } + return jiffies_to_clock_t(jiffies); } /* diff -ruNp a/kernel/sysctl.c b/kernel/sysctl.c --- a/kernel/sysctl.c 2003-04-03 21:33:54.000000000 -0800 +++ b/kernel/sysctl.c 2003-04-03 23:10:53.000000000 -0800 @@ -53,7 +53,16 @@ extern int max_queued_signals; extern int sysrq_enabled; extern int core_uses_pid; extern int cad_pid; - +extern int min_timeslice; +extern int max_timeslice; +extern int child_penalty; +extern int parent_penalty; +extern int exit_weight; +extern int prio_bonus_ratio; +extern int interactive_delta; +extern int max_sleep_avg; +extern int starvation_limit; + /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ static int maxolduid = 65535; static int minolduid; @@ -112,6 +121,7 @@ static struct ctl_table_header root_tabl static ctl_table kern_table[]; static ctl_table vm_table[]; +static ctl_table sched_table[]; #ifdef CONFIG_NET extern ctl_table net_table[]; #endif @@ -156,6 +166,7 @@ static ctl_table root_table[] = { {CTL_FS, "fs", NULL, 0, 0555, fs_table}, {CTL_DEBUG, "debug", NULL, 0, 0555, debug_table}, {CTL_DEV, "dev", NULL, 0, 0555, dev_table}, + {CTL_SCHED, "sched", NULL, 0, 0555, sched_table}, {0} }; @@ -329,8 +340,42 @@ static ctl_table debug_table[] = { static ctl_table dev_table[] = { {0} -}; +}; + +static int zero = 0; +static int one = 1; +static ctl_table sched_table[] = { + {SCHED_MAX_TIMESLICE, "max_timeslice", &max_timeslice, + sizeof(int), 0644, NULL, &proc_dointvec_minmax, + &sysctl_intvec, NULL, &one, NULL}, + {SCHED_MIN_TIMESLICE, "min_timeslice", &min_timeslice, + sizeof(int), 0644, NULL, &proc_dointvec_minmax, + &sysctl_intvec, NULL, &one, NULL}, + {SCHED_CHILD_PENALTY, "child_penalty", &child_penalty, + sizeof(int), 0644, NULL, &proc_dointvec_minmax, + &sysctl_intvec, NULL, &zero, NULL}, + {SCHED_PARENT_PENALTY, "parent_penalty", &parent_penalty, + sizeof(int), 0644, NULL, &proc_dointvec_minmax, + &sysctl_intvec, NULL, &zero, NULL}, + {SCHED_EXIT_WEIGHT, "exit_weight", &exit_weight, + sizeof(int), 0644, NULL, &proc_dointvec_minmax, + &sysctl_intvec, NULL, &zero, NULL}, + {SCHED_PRIO_BONUS_RATIO, "prio_bonus_ratio", &prio_bonus_ratio, + sizeof(int), 0644, NULL, &proc_dointvec_minmax, + &sysctl_intvec, NULL, &zero, NULL}, + {SCHED_INTERACTIVE_DELTA, "interactive_delta", &interactive_delta, + sizeof(int), 0644, NULL, &proc_dointvec_minmax, + &sysctl_intvec, NULL, &zero, NULL}, + {SCHED_MAX_SLEEP_AVG, "max_sleep_avg", &max_sleep_avg, + sizeof(int), 0644, NULL, &proc_dointvec_minmax, + &sysctl_intvec, NULL, &one, NULL}, + {SCHED_STARVATION_LIMIT, "starvation_limit", &starvation_limit, + sizeof(int), 0644, NULL, &proc_dointvec_minmax, + &sysctl_intvec, NULL, &zero, NULL}, + {0} +}; + extern void init_irq_proc (void); void __init sysctl_init(void) -- Eric Wong -- Kernelnewbies: Help each other learn about the Linux kernel. Archive: http://mail.nl.linux.org/kernelnewbies/ FAQ: http://kernelnewbies.org/faq/