[PATCH] some 2.5 scheduler backporting to ck4

Eric Wong <eric@yhbt.net> · Fri, 4 Apr 2003 13:24:01 -0800

Hi, I've ported some parts of the latest 2.5.6x scheduler to 2.4.20-ck4.
I also included variable-hz again for 1000 Hz (to match 2.5) as well as
sched-tunables.

I'm not sure how correct it is, but it seems to work well.  I made these
against ck4-rmap15d with rmap15e incremental patch, ignoring the
elevator.h unpatches in the 15e incremental.  Contest benchmarks in
another email.



diff -ruNp a/Documentation/Configure.help b/Documentation/Configure.help

--- a/Documentation/Configure.help	2003-04-03 21:31:34.000000000 -0800
+++ b/Documentation/Configure.help	2003-04-03 23:15:05.000000000 -0800
@@ -2439,6 +2439,18 @@ CONFIG_HEARTBEAT
   behaviour is platform-dependent, but normally the flash frequency is
   a hyperbolic function of the 5-minute load average.
 
+Timer frequency
+CONFIG_HZ
+  The frequency the system timer interrupt pops.  Higher tick values provide
+  improved granularity of timers, improved select() and poll() performance,
+  and lower scheduling latency.  Higher values, however, increase interrupt
+  overhead and will allow jiffie wraparound sooner.  For compatibility, the
+  tick count is always exported as if HZ=100.
+
+  The default value, which was the value for all of eternity, is 100.  If
+  you are looking to provide better timer granularity or increased desktop
+  performance, try 500 or 1000.  In unsure, go with the default of 100.
+
 Networking support
 CONFIG_NET
   Unless you really know what you are doing, you should say Y here.
diff -ruNp a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt	2002-12-09 02:24:08.000000000 -0800
+++ b/Documentation/filesystems/proc.txt	2003-04-03 23:10:53.000000000 -0800
@@ -37,6 +37,7 @@ Table of Contents
   2.8	/proc/sys/net/ipv4 - IPV4 settings
   2.9	Appletalk
   2.10	IPX
+  2.11  /proc/sys/sched - scheduler tunables
 
 ------------------------------------------------------------------------------
 Preface
@@ -1779,6 +1780,92 @@ The /proc/net/ipx_route  table  holds  a
 gives the  destination  network, the router node (or Directly) and the network
 address of the router (or Connected) for internal networks.
 
+2.11 /proc/sys/sched - scheduler tunables
+-----------------------------------------
+
+Useful knobs for tuning the scheduler live in /proc/sys/sched.
+
+child_penalty
+-------------
+
+Percentage of the parent's sleep_avg that children inherit.  sleep_avg is
+a running average of the time a process spends sleeping.  Tasks with high
+sleep_avg values are considered interactive and given a higher dynamic
+priority and a larger timeslice.  You typically want this some value just
+under 100.
+
+exit_weight
+-----------
+
+When a CPU hog task exits, its parent's sleep_avg is reduced by a factor of
+exit_weight against the exiting task's sleep_avg.
+
+interactive_delta
+-----------------
+
+If a task is "interactive" it is reinserted into the active array after it
+has expired its timeslice, instead of being inserted into the expired array.
+How "interactive" a task must be in order to be deemed interactive is a
+function of its nice value.  This interactive limit is scaled linearly by nice
+value and is offset by the interactive_delta.
+
+max_sleep_avg
+-------------
+
+max_sleep_avg is the largest value (in ms) stored for a task's running sleep
+average.  The larger this value, the longer a task needs to sleep to be
+considered interactive (maximum interactive bonus is a function of
+max_sleep_avg).
+
+max_timeslice
+-------------
+
+Maximum timeslice, in milliseconds.  This is the value given to tasks of the
+highest dynamic priority.
+
+min_timeslice
+-------------
+
+Minimum timeslice, in milliseconds.  This is the value given to tasks of the
+lowest dynamic priority.  Every task gets at least this slice of the processor
+per array switch.
+
+parent_penalty
+--------------
+
+Percentage of the parent's sleep_avg that it retains across a fork().
+sleep_avg is a running average of the time a process spends sleeping.  Tasks
+with high sleep_avg values are considered interactive and given a higher
+dynamic priority and a larger timeslice.  Normally, this value is 100 and thus
+task's retain their sleep_avg on fork.  If you want to punish interactive
+tasks for forking, set this below 100.
+
+prio_bonus_ratio
+----------------
+
+Middle percentage of the priority range that tasks can receive as a dynamic
+priority.  The default value of 25% ensures that nice values at the
+extremes are still enforced.  For example, nice +19 interactive tasks will
+never be able to preempt a nice 0 CPU hog.  Setting this higher will increase
+the size of the priority range the tasks can receive as a bonus.  Setting
+this lower will decrease this range, making the interactivity bonus less
+apparent and user nice values more applicable.
+
+starvation_limit
+----------------
+
+Sufficiently interactive tasks are reinserted into the active array when they
+run out of timeslice.  Normally, tasks are inserted into the expired array.
+Reinserting interactive tasks into the active array allows them to remain
+runnable, which is important to interactive performance.  This could starve
+expired tasks, however, since the interactive task could prevent the array
+switch.  To prevent starving the tasks on the expired array for too long. the
+starvation_limit is the longest (in ms) we will let the expired array starve
+at the expense of reinserting interactive tasks back into active.  Higher
+values here give more preferance to running interactive tasks, at the expense
+of expired tasks.  Lower values provide more fair scheduling behavior, at the
+expense of interactivity.  The units are in milliseconds.
+
 ------------------------------------------------------------------------------
 Summary
 ------------------------------------------------------------------------------
diff -ruNp a/arch/i386/config.in b/arch/i386/config.in
--- a/arch/i386/config.in	2003-04-03 21:33:54.000000000 -0800
+++ b/arch/i386/config.in	2003-04-03 23:15:05.000000000 -0800
@@ -240,6 +240,7 @@ endmenu
 mainmenu_option next_comment
 comment 'General setup'
 
+int 'Timer frequency (HZ) (100)' CONFIG_HZ 1000
 bool 'Networking support' CONFIG_NET
 
 # Visual Workstation support is utterly broken.
diff -ruNp a/fs/proc/array.c b/fs/proc/array.c
--- a/fs/proc/array.c	2003-04-03 21:33:54.000000000 -0800
+++ b/fs/proc/array.c	2003-04-03 23:15:05.000000000 -0800
@@ -360,15 +360,15 @@ int proc_pid_stat(struct task_struct *ta
 		task->cmin_flt,
 		task->maj_flt,
 		task->cmaj_flt,
-		task->times.tms_utime,
-		task->times.tms_stime,
-		task->times.tms_cutime,
-		task->times.tms_cstime,
+		jiffies_to_clock_t(task->times.tms_utime),
+		jiffies_to_clock_t(task->times.tms_stime),
+		jiffies_to_clock_t(task->times.tms_cutime),
+		jiffies_to_clock_t(task->times.tms_cstime),
 		priority,
 		nice,
 		0UL /* removed */,
-		task->it_real_value,
-		task->start_time,
+		jiffies_to_clock_t(task->it_real_value),
+		jiffies_to_clock_t(task->start_time),
 		vsize,
 		mm ? mm->rss : 0, /* you might want to shift this left 3 */
 		task->rlim[RLIMIT_RSS].rlim_cur,
@@ -687,14 +687,14 @@ int proc_pid_cpu(struct task_struct *tas
 
 	len = sprintf(buffer,
 		"cpu  %lu %lu\n",
-		task->times.tms_utime,
-		task->times.tms_stime);
+		jiffies_to_clock_t(task->times.tms_utime),
+		jiffies_to_clock_t(task->times.tms_stime));
 		
 	for (i = 0 ; i < smp_num_cpus; i++)
 		len += sprintf(buffer + len, "cpu%d %lu %lu\n",
 			i,
-			task->per_cpu_utime[cpu_logical_map(i)],
-			task->per_cpu_stime[cpu_logical_map(i)]);
+			jiffies_to_clock_t(task->per_cpu_utime[cpu_logical_map(i)]),
+			jiffies_to_clock_t(task->per_cpu_stime[cpu_logical_map(i)]));
 
 	return len;
 }
diff -ruNp a/fs/proc/proc_misc.c b/fs/proc/proc_misc.c
--- a/fs/proc/proc_misc.c	2003-04-03 21:33:54.000000000 -0800
+++ b/fs/proc/proc_misc.c	2003-04-03 23:15:05.000000000 -0800
@@ -316,16 +316,16 @@ static int kstat_read_proc(char *page, c
 {
 	int i, len = 0;
 	extern unsigned long total_forks;
-	unsigned long jif = jiffies;
+	unsigned long jif = jiffies_to_clock_t(jiffies);
 	unsigned int sum = 0, user = 0, nice = 0, system = 0;
 	int major, disk;
 
 	for (i = 0 ; i < smp_num_cpus; i++) {
 		int cpu = cpu_logical_map(i), j;
 
-		user += kstat.per_cpu_user[cpu];
-		nice += kstat.per_cpu_nice[cpu];
-		system += kstat.per_cpu_system[cpu];
+		user += jiffies_to_clock_t(kstat.per_cpu_user[cpu]);
+		nice += jiffies_to_clock_t(kstat.per_cpu_nice[cpu]);
+		system += jiffies_to_clock_t(kstat.per_cpu_system[cpu]);
 #if !defined(CONFIG_ARCH_S390)
 		for (j = 0 ; j < NR_IRQS ; j++)
 			sum += kstat.irqs[cpu][j];
@@ -339,10 +339,10 @@ static int kstat_read_proc(char *page, c
 		proc_sprintf(page, &off, &len,
 			"cpu%d %u %u %u %lu\n",
 			i,
-			kstat.per_cpu_user[cpu_logical_map(i)],
-			kstat.per_cpu_nice[cpu_logical_map(i)],
-			kstat.per_cpu_system[cpu_logical_map(i)],
-			jif - (  kstat.per_cpu_user[cpu_logical_map(i)] \
+			jiffies_to_clock_t(kstat.per_cpu_user[cpu_logical_map(i)]),
+			jiffies_to_clock_t(kstat.per_cpu_nice[cpu_logical_map(i)]),
+			jiffies_to_clock_t(kstat.per_cpu_system[cpu_logical_map(i)]),
+			jif - jiffies_to_clock_t(kstat.per_cpu_user[cpu_logical_map(i)] \
 				   + kstat.per_cpu_nice[cpu_logical_map(i)] \
 				   + kstat.per_cpu_system[cpu_logical_map(i)]));
 	proc_sprintf(page, &off, &len,
diff -ruNp a/include/asm-i386/param.h b/include/asm-i386/param.h
--- a/include/asm-i386/param.h	2000-10-27 11:04:43.000000000 -0700
+++ b/include/asm-i386/param.h	2003-04-03 23:15:05.000000000 -0800
@@ -1,8 +1,17 @@
 #ifndef _ASMi386_PARAM_H
 #define _ASMi386_PARAM_H
 
+#include <linux/config.h>
+
+#ifdef __KERNEL__
+# define HZ		CONFIG_HZ	/* internal kernel timer frequency */
+# define USER_HZ	100		/* some user interfaces are in ticks */
+# define CLOCKS_PER_SEC	(USER_HZ)	/* like times() */
+# define jiffies_to_clock_t(x)	((x) / ((HZ) / (USER_HZ)))
+#endif
+
 #ifndef HZ
-#define HZ 100
+#define HZ 100				/* if userspace cheats, give them 100 */
 #endif
 
 #define EXEC_PAGESIZE	4096
@@ -17,8 +26,4 @@
 
 #define MAXHOSTNAMELEN	64	/* max length of hostname */
 
-#ifdef __KERNEL__
-# define CLOCKS_PER_SEC	100	/* frequency at which times() counts */
-#endif
-
 #endif
diff -ruNp a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	2003-04-03 21:33:54.000000000 -0800
+++ b/include/linux/sched.h	2003-04-03 23:10:53.000000000 -0800
@@ -356,7 +356,7 @@ struct task_struct {
 	prio_array_t *array;
 
 	unsigned long sleep_avg;
-	unsigned long sleep_timestamp;
+	unsigned long last_run;
 
 	unsigned long policy;
 	unsigned long cpus_allowed;
@@ -387,6 +387,7 @@ struct task_struct {
 	 * older sibling, respectively.  (p->father can be replaced with
 	 * p->p_pptr->pid)
 	 */
+	struct task_struct *parent;
 	task_t *p_opptr, *p_pptr, *p_cptr, *p_ysptr, *p_osptr;
 	struct list_head thread_group;
 
diff -ruNp a/include/linux/sysctl.h b/include/linux/sysctl.h
--- a/include/linux/sysctl.h	2003-04-03 21:33:54.000000000 -0800
+++ b/include/linux/sysctl.h	2003-04-03 23:10:53.000000000 -0800
@@ -63,7 +63,8 @@ enum
 	CTL_DEV=7,		/* Devices */
 	CTL_BUS=8,		/* Busses */
 	CTL_ABI=9,		/* Binary emulation */
-	CTL_CPU=10		/* CPU stuff (speed scaling, etc) */
+	CTL_CPU=10,		/* CPU stuff (speed scaling, etc) */
+	CTL_SCHED=11,		/* scheduler tunables */
 };
 
 /* CTL_BUS names: */
@@ -148,6 +149,19 @@ enum
 	VM_PAGEBUF=14,		/* struct: Control pagebuf parameters */
 };
 
+/* Tunable scheduler parameters in /proc/sys/sched/ */
+enum
+{
+	SCHED_MIN_TIMESLICE=1,		/* minimum process timeslice */
+	SCHED_MAX_TIMESLICE=2,		/* maximum process timeslice */
+	SCHED_CHILD_PENALTY=3,		/* penalty on fork to child */
+	SCHED_PARENT_PENALTY=4,		/* penalty on fork to parent */
+	SCHED_EXIT_WEIGHT=5,		/* penalty to parent of CPU hog child */
+	SCHED_PRIO_BONUS_RATIO=6,	/* percent of max prio given as bonus */
+	SCHED_INTERACTIVE_DELTA=7,	/* delta used to scale interactivity */
+	SCHED_MAX_SLEEP_AVG=8,		/* maximum sleep avg attainable */
+	SCHED_STARVATION_LIMIT=9,	/* no re-active if expired is starved */
+};
 
 /* CTL_NET names: */
 enum
diff -ruNp a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c	2003-04-03 21:33:54.000000000 -0800
+++ b/kernel/fork.c	2003-04-03 23:10:53.000000000 -0800
@@ -727,7 +727,7 @@ int do_fork(unsigned long clone_flags, u
 		current->time_slice = 1;
 		scheduler_tick(0,0);
 	}
-	p->sleep_timestamp = jiffies;
+	p->last_run = jiffies;
 	__sti();
 
 	/*
diff -ruNp a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	2003-04-03 21:33:54.000000000 -0800
+++ b/kernel/sched.c	2003-04-03 23:10:53.000000000 -0800
@@ -52,15 +52,26 @@
  * maximum timeslice is 300 msecs. Timeslices get refilled after
  * they expire.
    */
-#define MIN_TIMESLICE		( 10 * HZ / 1000 )
-#define MAX_TIMESLICE		( 1000 * HZ / 1000 )
-#define CHILD_PENALTY		95
-#define PARENT_PENALTY		100
-#define EXIT_WEIGHT		3
-#define PRIO_BONUS_RATIO	15
-#define INTERACTIVE_DELTA	4
-#define MAX_SLEEP_AVG		(2*HZ)
-#define STARVATION_LIMIT	(3*HZ)
+int min_timeslice = ((5 * HZ) / 1000 ?: 1);
+int max_timeslice = (200 * HZ) / 1000;
+int child_penalty = 50;
+int parent_penalty = 100;
+int exit_weight = 3;
+int prio_bonus_ratio = 25;
+int interactive_delta = 2;
+int max_sleep_avg = 10 * HZ;
+int starvation_limit = 10 * HZ;
+
+#define MIN_TIMESLICE          (min_timeslice)
+#define MAX_TIMESLICE          (max_timeslice)
+#define CHILD_PENALTY          (child_penalty)
+#define PARENT_PENALTY         (parent_penalty)
+#define EXIT_WEIGHT            (exit_weight)
+#define PRIO_BONUS_RATIO       (prio_bonus_ratio)
+#define INTERACTIVE_DELTA      (interactive_delta)
+#define MAX_SLEEP_AVG          (max_sleep_avg)
+#define STARVATION_LIMIT       (starvation_limit)
+#define TIMESLICE_GRANULARITY	(HZ/20 ?: 1)
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -115,14 +126,19 @@
  * downside in using shorter timeslices.
  */
 
-static inline unsigned int task_timeslice(task_t *p)
+#define BASE_TIMESLICE(p) \
+	(MAX_TIMESLICE * (MAX_PRIO-(p)->static_prio)/MAX_USER_PRIO)   	
+
+static unsigned int task_timeslice(task_t *p)
 {
-	if (p->policy == SCHED_BATCH)
-		return MAX_TIMESLICE;
-	else
-		return MIN_TIMESLICE;
-}
+       unsigned int time_slice = BASE_TIMESLICE(p);
+
+       if (time_slice < MIN_TIMESLICE)
+               time_slice = MIN_TIMESLICE;
 
+       return time_slice;
+}
+	
 /*
  * These are the runqueue data structures:
  */
@@ -149,6 +165,7 @@ struct runqueue {
 	unsigned long nr_running, nr_switches, expired_timestamp,
 			nr_uninterruptible;
 	task_t *curr, *idle;
+	struct mm_struct *prev_mm;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
 
@@ -191,6 +208,10 @@ static struct runqueue runqueues[NR_CPUS
 # define task_running(rq, p)           ((rq)->curr == (p))
 #endif
 
+# define nr_running_init(rq)   do { } while (0)
+# define nr_running_inc(rq)    do { (rq)->nr_running++; } while (0)
+# define nr_running_dec(rq)    do { (rq)->nr_running--; } while (0)
+
 /*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
@@ -273,6 +294,9 @@ static inline int effective_prio(task_t 
 	 *
 	 * Both properties are important to certain workloads.
  */
+	if (rt_task(p))
+		return p->prio;
+
 	bonus = MAX_USER_PRIO*PRIO_BONUS_RATIO*p->sleep_avg/MAX_SLEEP_AVG/100 -
 			MAX_USER_PRIO*PRIO_BONUS_RATIO/100/2;
 
@@ -284,27 +308,58 @@ static inline int effective_prio(task_t 
 	return prio;
 }
 
-static inline void activate_task(task_t *p, runqueue_t *rq)
+static inline void __activate_task(task_t *p, runqueue_t *rq)
 {
-	unsigned long sleep_time = jiffies - p->sleep_timestamp;
-	prio_array_t *array = rq->active;
+	enqueue_task(p, rq->active);
+	nr_running_inc(rq);
+}
 
-	if (!rt_task(p) && sleep_time) {
-	/*
-		 * This code gives a bonus to interactive tasks. We update
-		 * an 'average sleep time' value here, based on
-		 * sleep_timestamp. The more time a task spends sleeping,
-		 * the higher the average gets - and the higher the priority
-		 * boost gets as well.
-  		 */
-		p->sleep_avg += sleep_time;
-		if (p->sleep_avg > MAX_SLEEP_AVG)
-			p->sleep_avg = MAX_SLEEP_AVG;
-		p->prio = effective_prio(p);
+static inline int activate_task(task_t *p, runqueue_t *rq)
+{
+	long sleep_time = jiffies - p->last_run - 1;
+	int requeue_waker = 0;
+
+	if (sleep_time > 0) {
+		int sleep_avg;
+
+		/*
+		 * This code gives a bonus to interactive tasks.
+		 *
+		 * The boost works by updating the 'average sleep time'
+		 * value here, based on ->last_run. The more time a task
+		 * spends sleeping, the higher the average gets - and the
+		 * higher the priority boost gets as well.
+		 */
+		sleep_avg = p->sleep_avg + sleep_time;
+
+		/*
+		 * 'Overflow' bonus ticks go to the waker as well, so the
+		 * ticks are not lost. This has the effect of further
+		 * boosting tasks that are related to maximum-interactive
+		 * tasks.
+		 */
+		if (sleep_avg > MAX_SLEEP_AVG) {
+			if (!in_interrupt()) {
+				sleep_avg += current->sleep_avg - MAX_SLEEP_AVG;
+				if (sleep_avg > MAX_SLEEP_AVG)
+					sleep_avg = MAX_SLEEP_AVG;
+
+				if (current->sleep_avg != sleep_avg) {
+					current->sleep_avg = sleep_avg;
+					requeue_waker = 1;
+				}
+			}
+			sleep_avg = MAX_SLEEP_AVG;
+		}
+		if (p->sleep_avg != sleep_avg) {
+			p->sleep_avg = sleep_avg;
+			p->prio = effective_prio(p);
 		}
-	enqueue_task(p, array);
-	rq->nr_running++;
 	}
+	__activate_task(p, rq);
+
+	return requeue_waker;
+}
 
 static inline void activate_batch_task(task_t *p, runqueue_t *rq)
 {
@@ -316,7 +371,7 @@ static inline void activate_batch_task(t
 
 static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
-	rq->nr_running--;
+	nr_running_dec(rq);
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
@@ -378,7 +433,7 @@ static inline void resched_task(task_t *
  * ptrace() code.
    */
 void wait_task_inactive(task_t * p)
-  {
+{
 	unsigned long flags;
 	runqueue_t *rq;
 
@@ -419,23 +474,8 @@ repeat:
  */
 void kick_if_running(task_t * p)
 {
-	if (task_running(task_rq(p), p) && (p->cpu != smp_processor_id()))
+	if (task_running(task_rq(p), p) && (task_cpu(p) != smp_processor_id()))
 		resched_task(p);
-	/*
-	 * If batch processes get signals but are not running currently
-	 * then give them a chance to handle the signal. (the kernel
-	 * side signal handling code will run for sure, the userspace
-	 * part depends on system load and might be delayed indefinitely.)
-	 */
-	if (p->policy == SCHED_BATCH) {
-		unsigned long flags;
-		runqueue_t *rq;
-
-		rq = task_rq_lock(p, &flags);
-		if (p->flags & PF_BATCH)
-			activate_batch_task(p, rq);
-		task_rq_unlock(rq, &flags);
-	}
 }
 
 /*
@@ -449,70 +489,99 @@ void kick_if_running(task_t * p)
  * returns failure only if the task is already active.
  */
 
-static int try_to_wake_up(task_t * p, int sync)
+static int try_to_wake_up(task_t * p, unsigned int state, int sync)
 {
+	int success = 0, requeue_waker = 0;
 	unsigned long flags;
-	int success = 0;
 	long old_state;
 	runqueue_t *rq;
 
 repeat_lock_task:
 	rq = task_rq_lock(p, &flags);
 	old_state = p->state;
-	if (!p->array) {
-	/*
-		 * Fast-migrate the task if it's not running or runnable
-		 * currently. Do not violate hard affinity.
-	 */
-		if (unlikely(sync && !task_running(rq, p) &&
-			(task_cpu(p) != smp_processor_id()) &&
-			(p->cpus_allowed & (1UL << smp_processor_id())))) {
-
-			set_task_cpu(p, smp_processor_id());
+	if (old_state & state) {
+		if (!p->array) {
+		/*
+			 * Fast-migrate the task if it's not running or runnable
+			 * currently. Do not violate hard affinity.
+		 */
+			if (unlikely(sync && !task_running(rq, p) &&
+				(task_cpu(p) != smp_processor_id()) &&
+				(p->cpus_allowed & (1UL << smp_processor_id())))) {
+	
+				set_task_cpu(p, smp_processor_id());
+	
+				task_rq_unlock(rq, &flags);
+				goto repeat_lock_task;
+			}
+			if (old_state == TASK_UNINTERRUPTIBLE)
+				rq->nr_uninterruptible--;
 
-			task_rq_unlock(rq, &flags);
-			goto repeat_lock_task;
+			if (sync)
+				__activate_task(p, rq);
+			else {
+				requeue_waker = activate_task(p, rq);
+				if (p->prio < rq->curr->prio)
+					resched_task(rq->curr);
+			}
+			success = 1;
 		}
-		if (old_state == TASK_UNINTERRUPTIBLE)
-			rq->nr_uninterruptible--;
-		activate_task(p, rq);
-
-		if (p->prio < rq->curr->prio || rq->curr->policy == SCHED_BATCH)
-			resched_task(rq->curr);
-	success = 1;
+		p->state = TASK_RUNNING;
 	}
-	p->state = TASK_RUNNING;
 	task_rq_unlock(rq, &flags);
 
+	/*
+	 * We have to do this outside the other spinlock, the two
+	 * runqueues might be different:
+	 */
+	if (requeue_waker) {
+		prio_array_t *array;
+
+		rq = task_rq_lock(current, &flags);
+		array = current->array;
+		dequeue_task(current, array);
+		current->prio = effective_prio(current);
+		enqueue_task(current, array);
+		task_rq_unlock(rq, &flags);
+	}	
+
 	return success;
 }
 
 int wake_up_process(task_t * p)
 {
-	return try_to_wake_up(p, 0);
+	return try_to_wake_up(p, TASK_STOPPED | TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);
 }
 
 void wake_up_forked_process(task_t * p)
 {
-	runqueue_t *rq ;
+	runqueue_t *rq;
+	unsigned long flags;
 	preempt_disable();
-	rq = this_rq_lock();
+	
+	rq = task_rq_lock(current, &flags);
 
 	p->state = TASK_RUNNING;
-	if (!rt_task(p)) {
-		/*
-		 * We decrease the sleep average of forking parents
-		 * and children as well, to keep max-interactive tasks
-		 * from forking tasks that are max-interactive.
-		 */
-		current->sleep_avg = current->sleep_avg * PARENT_PENALTY / 100;
-		p->sleep_avg = p->sleep_avg * CHILD_PENALTY / 100;
-		p->prio = effective_prio(p);
-}
+	/*
+	 * We decrease the sleep average of forking parents
+	 * and children as well, to keep max-interactive tasks
+	 * from forking tasks that are max-interactive.
+	 */
+	current->sleep_avg = current->sleep_avg * PARENT_PENALTY / 100;
+	p->sleep_avg = p->sleep_avg * CHILD_PENALTY / 100;
+	p->prio = effective_prio(p);
 	set_task_cpu(p, smp_processor_id());
-	activate_task(p, rq);
 
-	rq_unlock(rq);
+	if (unlikely(!current->array))
+		__activate_task(p, rq);
+	else {
+		p->prio = current->prio;
+		list_add_tail(&p->run_list, &current->run_list);
+		p->array = current->array;
+		p->array->nr_active++;
+		nr_running_inc(rq);
+	}
+	task_rq_unlock(rq, &flags);
 	preempt_enable();
 }
 
@@ -527,13 +596,15 @@ void wake_up_forked_process(task_t * p)
  */
 void sched_exit(task_t * p)
 {
-	__cli();
+	unsigned long flags;
+	
+	local_irq_save(flags);
 	if (p->first_time_slice) {
 		current->time_slice += p->time_slice;
 		if (unlikely(current->time_slice > MAX_TIMESLICE))
 			current->time_slice = MAX_TIMESLICE;
 	}
-	__sti();
+	local_irq_restore(flags);
 	/*
 	 * If the child was a (relative-) CPU hog then decrease
 	 * the sleep_avg of the parent as well.
@@ -550,7 +621,7 @@ asmlinkage void schedule_tail(task_t *pr
 }
 #endif
 
-static inline task_t * context_switch(task_t *prev, task_t *next)
+static inline task_t * context_switch(runqueue_t *rq, task_t *prev, task_t *next)
 {
 	struct mm_struct *mm = next->mm;
 	struct mm_struct *oldmm = prev->active_mm;
@@ -564,7 +635,7 @@ static inline task_t * context_switch(ta
 
 	if (unlikely(!prev->mm)) {
 		prev->active_mm = NULL;
-		mmdrop(oldmm);
+		rq->prev_mm = oldmm;
 	}
 
 	/* Here we just switch the register state and the stack. */
@@ -824,9 +895,9 @@ static inline runqueue_t *find_busiest_q
 static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
 	dequeue_task(p, src_array);
-	src_rq->nr_running--;
+	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
+	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
@@ -834,6 +905,11 @@ static inline void pull_task(runqueue_t 
 	 */
 	if (p->prio < this_rq->curr->prio)
 		set_need_resched();
+	else {
+		if (p->prio == this_rq->curr->prio &&
+				p->time_slice > this_rq->curr->time_slice)
+			set_need_resched();
+	}
 }
 
 	/*
@@ -896,7 +972,7 @@ skip_queue:
 	 */
 
 #define CAN_MIGRATE_TASK(p,rq,this_cpu)					\
-	((jiffies - (p)->sleep_timestamp > cache_decay_ticks) &&	\
+	((idle || (jiffies - (p)->last_run > cache_decay_ticks)) && \
 		!task_running(rq, p) &&					\
 			((p)->cpus_allowed & (1UL << (this_cpu))))
 
@@ -954,9 +1030,9 @@ static inline void idle_tick(runqueue_t 
  * increasing number of running tasks:
 	 */
 #define EXPIRED_STARVING(rq) \
-		((rq)->expired_timestamp && \
+		(STARVATION_LIMIT && ((rq)->expired_timestamp && \
 		(jiffies - (rq)->expired_timestamp >= \
-			STARVATION_LIMIT * ((rq)->nr_running) + 1))
+			STARVATION_LIMIT * ((rq)->nr_running) + 1 )))
 
 /*
  * This function gets called by the timer code, with HZ frequency.
@@ -985,7 +1061,7 @@ void scheduler_tick(int user_ticks, int 
 
 		}
 	}
-	if (p == rq->idle || p->policy == SCHED_BATCH)
+	if (p == rq->idle)
 		rq->idle_count++;
 #endif
 	if (p == rq->idle) {
@@ -996,7 +1072,7 @@ void scheduler_tick(int user_ticks, int 
 #endif
 		return;
 	}
-	if (TASK_NICE(p) > 0 || p->policy == SCHED_BATCH)
+	if (TASK_NICE(p) > 0)
 		kstat.per_cpu_nice[cpu] += user_ticks;
 	else
 		kstat.per_cpu_user[cpu] += user_ticks;
@@ -1008,6 +1084,17 @@ void scheduler_tick(int user_ticks, int 
 		return;
 	}
 	spin_lock(&rq->lock);
+	/*
+	 * The task was running during this tick - update the
+	 * time slice counter and the sleep average. Note: we
+	 * do not update a process's priority until it either
+	 * goes to sleep or uses up its timeslice. This makes
+	 * it possible for interactive tasks to use up their
+	 * timeslices at their highest priority levels.
+	 */
+	if (p->sleep_avg)
+		p->sleep_avg--;
+
 	if (unlikely(rt_task(p))) {
 		/*
 		 * RR tasks need a special form of timeslice management.
@@ -1024,16 +1111,6 @@ void scheduler_tick(int user_ticks, int 
 		}
 		goto out;
 	}
-	/*
-	 * The task was running during this tick - update the
-	 * time slice counter and the sleep average. Note: we
-	 * do not update a process's priority until it either
-	 * goes to sleep or uses up its timeslice. This makes
-	 * it possible for interactive tasks to use up their
-	 * timeslices at their highest priority levels.
-	 */
-	if (p->sleep_avg)
-		p->sleep_avg--;
 	if (!--p->time_slice) {
 		dequeue_task(p, rq->active);
 		set_tsk_need_resched(p);
@@ -1047,6 +1124,28 @@ void scheduler_tick(int user_ticks, int 
 			enqueue_task(p, rq->expired);
 		} else
 			enqueue_task(p, rq->active);
+	} else {
+		/*
+		 * Prevent a too long timeslice allowing a task to monopolize
+		 * the CPU. We do this by splitting up the timeslice into
+		 * smaller pieces.
+		 *
+		 * Note: this does not mean the task's timeslices expire or
+		 * get lost in any way, they just might be preempted by
+		 * another task of equal priority. (one with higher
+		 * priority would have preempted this task already.) We
+		 * requeue this task to the end of the list on this priority
+		 * level, which is in essence a round-robin of tasks with
+		 * equal priority.
+		 */
+		if (!(p->time_slice % TIMESLICE_GRANULARITY) &&
+			       		(p->array == rq->active)) {
+			dequeue_task(p, rq->active);
+			set_tsk_need_resched(p);
+			p->prio = effective_prio(p);
+			enqueue_task(p, rq->active);
+		}
+
 	}
 out:
 #if CONFIG_SMP
@@ -1107,7 +1206,7 @@ need_resched:
 	rq = this_rq();
 
 	release_kernel_lock(prev, smp_processor_id());
-	prev->sleep_timestamp = jiffies;
+	prev->last_run = jiffies;
 	spin_lock_irq(&rq->lock);
 
 	/*
@@ -1173,7 +1272,7 @@ switch_tasks:
 		rq->curr = next;
 
 		prepare_arch_switch(rq, next);
-		prev = context_switch(prev, next);
+		prev = context_switch(rq, prev, next);
 		barrier();
 		rq = this_rq();
 		finish_arch_switch(rq, prev);
@@ -1230,7 +1337,7 @@ static inline void __wake_up_common(wait
 		curr = list_entry(tmp, wait_queue_t, task_list);
 		p = curr->task;
 		state = p->state;
-		if ((state & mode) && try_to_wake_up(p, sync) &&
+		if ((state & mode) && try_to_wake_up(p, state, sync) &&
 			((curr->flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive))
 				break;
 		}
@@ -1443,7 +1550,7 @@ asmlinkage long sys_nice(int increment)
  */
 int task_prio(task_t *p)
 {
-	return p->prio - MAX_USER_RT_PRIO;
+	return p->prio - MAX_RT_PRIO;
 }
 
 int task_nice(task_t *p)
@@ -1536,7 +1643,7 @@ static int setscheduler(pid_t pid, int p
 	else
 		p->prio = p->static_prio;
 	if (array)
-		activate_task(p, task_rq(p));
+		__activate_task(p, task_rq(p));
 
 out_unlock:
 	task_rq_unlock(rq, &flags);
@@ -2221,7 +2328,7 @@ void __init sched_init(void)
 	rq->curr = current;
 	rq->idle = current;
 	set_task_cpu(current, smp_processor_id());
-	wake_up_process(current);
+	wake_up_forked_process(current);
 
 	init_timervecs();
 	init_bh(TIMER_BH, timer_bh);
diff -ruNp a/kernel/signal.c b/kernel/signal.c
--- a/kernel/signal.c	2003-04-03 21:33:54.000000000 -0800
+++ b/kernel/signal.c	2003-04-03 23:15:05.000000000 -0800
@@ -13,7 +13,7 @@
 #include <linux/smp_lock.h>
 #include <linux/init.h>
 #include <linux/sched.h>
-
+#include <asm/param.h>
 #include <asm/uaccess.h>
 
 /*
@@ -775,8 +775,8 @@ void do_notify_parent(struct task_struct
 	info.si_uid = tsk->uid;
 
 	/* FIXME: find out whether or not this is supposed to be c*time. */
-	info.si_utime = tsk->times.tms_utime;
-	info.si_stime = tsk->times.tms_stime;
+	info.si_utime = jiffies_to_clock_t(tsk->times.tms_utime);
+	info.si_stime = jiffies_to_clock_t(tsk->times.tms_stime);
 
 	status = tsk->exit_code & 0x7f;
 	why = SI_KERNEL;	/* shouldn't happen */
diff -ruNp a/kernel/sys.c b/kernel/sys.c
--- a/kernel/sys.c	2003-04-03 21:33:54.000000000 -0800
+++ b/kernel/sys.c	2003-04-03 23:15:05.000000000 -0800
@@ -14,7 +14,7 @@
 #include <linux/prctl.h>
 #include <linux/init.h>
 #include <linux/highuid.h>
-
+#include <asm/param.h>
 #include <asm/uaccess.h>
 #include <asm/io.h>
 
@@ -791,16 +791,23 @@ asmlinkage long sys_setfsgid(gid_t gid)
 
 asmlinkage long sys_times(struct tms * tbuf)
 {
+	struct tms temp;
+
 	/*
 	 *	In the SMP world we might just be unlucky and have one of
 	 *	the times increment as we use it. Since the value is an
 	 *	atomically safe type this is just fine. Conceptually its
 	 *	as if the syscall took an instant longer to occur.
 	 */
-	if (tbuf)
-		if (copy_to_user(tbuf, &current->times, sizeof(struct tms)))
+	if (tbuf) {
+		temp.tms_utime = jiffies_to_clock_t(current->times.tms_utime);
+		temp.tms_stime = jiffies_to_clock_t(current->times.tms_stime);
+		temp.tms_cutime = jiffies_to_clock_t(current->times.tms_cutime);
+		temp.tms_cstime = jiffies_to_clock_t(current->times.tms_cstime);
+		if (copy_to_user(tbuf, &temp, sizeof(struct tms)))
 			return -EFAULT;
-	return jiffies;
+	}
+	return jiffies_to_clock_t(jiffies);
 }
 
 /*
diff -ruNp a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c	2003-04-03 21:33:54.000000000 -0800
+++ b/kernel/sysctl.c	2003-04-03 23:10:53.000000000 -0800
@@ -53,7 +53,16 @@ extern int max_queued_signals;
 extern int sysrq_enabled;
 extern int core_uses_pid;
 extern int cad_pid;
-
+extern int min_timeslice;
+extern int max_timeslice;
+extern int child_penalty;
+extern int parent_penalty;
+extern int exit_weight;
+extern int prio_bonus_ratio;
+extern int interactive_delta;
+extern int max_sleep_avg;
+extern int starvation_limit;
+ 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
 static int minolduid;
@@ -112,6 +121,7 @@ static struct ctl_table_header root_tabl
 
 static ctl_table kern_table[];
 static ctl_table vm_table[];
+static ctl_table sched_table[];
 #ifdef CONFIG_NET
 extern ctl_table net_table[];
 #endif
@@ -156,6 +166,7 @@ static ctl_table root_table[] = {
 	{CTL_FS, "fs", NULL, 0, 0555, fs_table},
 	{CTL_DEBUG, "debug", NULL, 0, 0555, debug_table},
         {CTL_DEV, "dev", NULL, 0, 0555, dev_table},
+	{CTL_SCHED, "sched", NULL, 0, 0555, sched_table},
 	{0}
 };
 
@@ -329,8 +340,42 @@ static ctl_table debug_table[] = {
 
 static ctl_table dev_table[] = {
 	{0}
-};  
+}; 
+
+static int zero = 0;
+static int one = 1;
 
+static ctl_table sched_table[] = {
+	{SCHED_MAX_TIMESLICE, "max_timeslice", &max_timeslice,
+	 sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+	 &sysctl_intvec, NULL, &one, NULL},
+	{SCHED_MIN_TIMESLICE, "min_timeslice", &min_timeslice,
+	 sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+	 &sysctl_intvec, NULL, &one, NULL},
+	{SCHED_CHILD_PENALTY, "child_penalty", &child_penalty,
+	 sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+	 &sysctl_intvec, NULL, &zero, NULL},
+	{SCHED_PARENT_PENALTY, "parent_penalty", &parent_penalty,
+	 sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+	 &sysctl_intvec, NULL, &zero, NULL},
+	{SCHED_EXIT_WEIGHT, "exit_weight", &exit_weight,
+	 sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+	 &sysctl_intvec, NULL, &zero, NULL},
+	{SCHED_PRIO_BONUS_RATIO, "prio_bonus_ratio", &prio_bonus_ratio,
+	 sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+	 &sysctl_intvec, NULL, &zero, NULL},
+	{SCHED_INTERACTIVE_DELTA, "interactive_delta", &interactive_delta,
+	 sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+	 &sysctl_intvec, NULL, &zero, NULL},
+	{SCHED_MAX_SLEEP_AVG, "max_sleep_avg", &max_sleep_avg,
+	 sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+	 &sysctl_intvec, NULL, &one, NULL},
+	{SCHED_STARVATION_LIMIT, "starvation_limit", &starvation_limit,
+	 sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+	 &sysctl_intvec, NULL, &zero, NULL},
+	{0}
+};
+ 
 extern void init_irq_proc (void);
 
 void __init sysctl_init(void)


-- 
Eric Wong
--
Kernelnewbies: Help each other learn about the Linux kernel.
Archive:       http://mail.nl.linux.org/kernelnewbies/
FAQ:           http://kernelnewbies.org/faq/