+ smpnice-dont-consider-sched-groups-which-are-lightly-loaded-for-balancing.patch added to -mm tree

akpm@xxxxxxxx · Thu, 20 Apr 2006 16:48:30 -0700

The patch titled

     smpnice: don't consider sched groups which are lightly loaded for balancing

has been added to the -mm tree.  Its filename is

     smpnice-dont-consider-sched-groups-which-are-lightly-loaded-for-balancing.patch

See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
out what to do about this


From: "Siddha, Suresh B" <suresh.b.siddha@xxxxxxxxx>

with smpnice, sched groups with highest priority tasks can mask the
imbalance between the other sched groups with in the same domain.  This
patch fixes some of the listed down scenarios by not considering the sched
groups which are lightly loaded.

a) on a simple 4-way MP system, if we have one high priority and 4
   normal priority tasks, with smpnice we would like to see the high
   priority task scheduled on one cpu, two other cpus getting one normal
   task each and the fourth cpu getting the remaining two normal tasks. 
   but with current smpnice extra normal priority task keeps jumping from
   one cpu to another cpu having the normal priority task.  This is because
   of the busiest_has_loaded_cpus, nr_loaded_cpus logic..  We are not
   including the cpu with high priority task in max_load calculations but
   including that in total and avg_load calcuations..  leading to max_load
   < avg_load and load balance between cpus running normal priority tasks(2
   Vs 1) will always show imbalanace as one normal priority and the extra
   normal priority task will keep moving from one cpu to another cpu having
   normal priority task..

b) 4-way system with HT (8 logical processors).  Package-P0 T0 has a
   highest priority task, T1 is idle.  Package-P1 Both T0 and T1 have 1
   normal priority task each..  P2 and P3 are idle.  With this patch, one
   of the normal priority tasks on P1 will be moved to P2 or P3..

c) With the current weighted smp nice calculations, it doesn't always
   make sense to look at the highest weighted runqueue in the busy group.. 
   Consider a load balance scenario on a DP with HT system, with Package-0
   containing one high priority and one low priority, Package-1 containing
   one low priority(with other thread being idle)..  Package-1 thinks that
   it need to take the low priority thread from Package-0.  And
   find_busiest_queue() returns the cpu thread with highest priority task..
    And ultimately(with help of active load balance) we move high priority
   task to Package-1.  And same continues with Package-0 now, moving high
   priority task from package-1 to package-0..  Even without the presence
   of active load balance, load balance will fail to balance the above
   scenario..  Fix find_busiest_queue to use "imbalance" when it is lightly
   loaded.

Signed-off-by: Suresh Siddha <suresh.b.siddha@xxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxx>
Acked-by: Peter Williams <pwil3058@xxxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxx>
---

 kernel/sched.c |   41 ++++++++++++++++++++---------------------
 1 files changed, 20 insertions(+), 21 deletions(-)

diff -puN kernel/sched.c~smpnice-dont-consider-sched-groups-which-are-lightly-loaded-for-balancing kernel/sched.c

--- devel/kernel/sched.c~smpnice-dont-consider-sched-groups-which-are-lightly-loaded-for-balancing	2006-04-20 16:48:07.000000000 -0700
+++ devel-akpm/kernel/sched.c	2006-04-20 16:48:07.000000000 -0700
@@ -2096,7 +2096,6 @@ find_busiest_group(struct sched_domain *
 	unsigned long max_pull;
 	unsigned long busiest_load_per_task, busiest_nr_running;
 	unsigned long this_load_per_task, this_nr_running;
-	unsigned int busiest_has_loaded_cpus = idle == NEWLY_IDLE;
 	int load_idx;
 
 	max_load = this_load = total_load = total_pwr = 0;
@@ -2151,15 +2150,8 @@ find_busiest_group(struct sched_domain *
 			this = group;
 			this_nr_running = sum_nr_running;
 			this_load_per_task = sum_weighted_load;
-		} else if (nr_loaded_cpus) {
-			if (avg_load > max_load || !busiest_has_loaded_cpus) {
-				max_load = avg_load;
-				busiest = group;
-				busiest_nr_running = sum_nr_running;
-				busiest_load_per_task = sum_weighted_load;
-				busiest_has_loaded_cpus = 1;
-			}
-		} else if (!busiest_has_loaded_cpus && avg_load > max_load) {
+		} else if (avg_load > max_load &&
+			   sum_nr_running > group->cpu_power / SCHED_LOAD_SCALE) {
 			max_load = avg_load;
 			busiest = group;
 			busiest_nr_running = sum_nr_running;
@@ -2192,6 +2184,16 @@ find_busiest_group(struct sched_domain *
 	if (max_load <= busiest_load_per_task)
 		goto out_balanced;
 
+	/*
+	 * In the presence of smp nice balancing, certain scenarios can have
+	 * max load less than avg load(as we skip the groups at or below
+	 * its cpu_power, while calculating max_load..)
+	 */
+	if (max_load < avg_load) {
+		*imbalance = 0;
+		goto small_imbalance;
+	}
+
 	/* Don't want to pull so many tasks that a group would go idle */
 	max_pull = min(max_load - avg_load, max_load - busiest_load_per_task);
 
@@ -2207,6 +2209,7 @@ find_busiest_group(struct sched_domain *
 	 * moved
 	 */
 	if (*imbalance < busiest_load_per_task) {
+small_imbalance:
 		unsigned long pwr_now = 0, pwr_move = 0;
 		unsigned long tmp;
 		unsigned int imbn = 2;
@@ -2269,23 +2272,19 @@ out_balanced:
  * find_busiest_queue - find the busiest runqueue among the cpus in group.
  */
 static runqueue_t *find_busiest_queue(struct sched_group *group,
-	enum idle_type idle)
+	enum idle_type idle, unsigned long imbalance)
 {
 	unsigned long max_load = 0;
 	runqueue_t *busiest = NULL, *rqi;
-	unsigned int busiest_is_loaded = idle == NEWLY_IDLE;
 	int i;
 
 	for_each_cpu_mask(i, group->cpumask) {
 		rqi = cpu_rq(i);
 
-		if (rqi->nr_running > 1) {
-			if (rqi->raw_weighted_load > max_load || !busiest_is_loaded) {
-				max_load = rqi->raw_weighted_load;
-				busiest = rqi;
-				busiest_is_loaded = 1;
-			}
-		} else if (!busiest_is_loaded && rqi->raw_weighted_load > max_load) {
+		if (rqi->nr_running == 1 && rqi->raw_weighted_load > imbalance)
+			continue;
+
+		if (rqi->raw_weighted_load > max_load) {
 			max_load = rqi->raw_weighted_load;
 			busiest = rqi;
 		}
@@ -2328,7 +2327,7 @@ static int load_balance(int this_cpu, ru
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(group, idle);
+	busiest = find_busiest_queue(group, idle, imbalance);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
@@ -2452,7 +2451,7 @@ static int load_balance_newidle(int this
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(group, NEWLY_IDLE);
+	busiest = find_busiest_queue(group, NEWLY_IDLE, imbalance);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
 		goto out_balanced;
_

Patches currently in -mm which might be from suresh.b.siddha@xxxxxxxxx are

sched-implement-smpnice.patch
sched-prevent-high-load-weight-tasks-suppressing-balancing.patch
sched-improve-stability-of-smpnice-load-balancing.patch
sched-improve-smpnice-load-balancing-when-load-per-task.patch
smpnice-dont-consider-sched-groups-which-are-lightly-loaded-for-balancing.patch
sched-modify-move_tasks-to-improve-load-balancing-outcomes.patch
sched_domain-handle-kmalloc-failure.patch
sched_domain-handle-kmalloc-failure-fix.patch
sched_domain-dont-use-gfp_atomic.patch
sched_domain-use-kmalloc_node.patch
sched_domain-allocate-sched_group-structures-dynamically.patch

-
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html