Track strength of convergence, which is a value between 1 and 1024. This will be used by the placement logic later on. A strength value of 1024 means that the workload has fully converged, all faults after the last scan period came from a single node. A value of 1024/nr_nodes means a totally spread out working set. 'max_faults' is the number of faults observed on the highest-faulting node. 'sum_faults' are all faults from the last scan, averaged over ~16 periods. The goal of the scheduler is to maximize convergence system-wide. Once a task has converged, it carries with it a non-trivial amount of working set. If such a task is migrated to another node later on then its working set will migrate there as well, which is a non-trivial cost. So the ultimate goal of NUMA scheduling is to let as many tasks converge as possible, and to run them as close to their memory as possible. ( Note: we could also sample migration activities to directly measure how much convergence influx there is. ) Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Cc: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> Cc: Rik van Riel <riel@xxxxxxxxxx> Cc: Mel Gorman <mgorman@xxxxxxx> Cc: Hugh Dickins <hughd@xxxxxxxxxx> Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx> --- include/linux/sched.h | 2 ++ kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 50 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 8eeb866..5b2cf2e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1509,6 +1509,8 @@ struct task_struct { unsigned long numa_scan_ts_secs; unsigned int numa_scan_period; u64 node_stamp; /* migration stamp */ + unsigned long convergence_strength; + int convergence_node; unsigned long *numa_faults; unsigned long *numa_faults_curr; struct callback_head numa_scan_work; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0fac735..26a2ede 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1555,6 +1555,8 @@ static void __sched_fork(struct task_struct *p) p->numa_shared = -1; p->node_stamp = 0ULL; + p->convergence_strength = 0; + p->convergence_node = -1; p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0; p->numa_faults = NULL; p->numa_scan_period = sysctl_sched_numa_scan_delay; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7af89b7..1f6104a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1934,6 +1934,50 @@ clear_buddy: } /* + * Update the p->convergence_strength info, which is a value between 1 and 1024. + * + * A strength value of 1024 means that the workload has fully + * converged, all faults after the last scan period came from a + * single node. + * + * A value of 1024/nr_nodes means a totally spread out working set. + * + * 'max_faults' is the number of faults observed on the highest-faulting node. + * 'sum_faults' are all faults from the last scan, averaged over ~8 periods. + * + * The goal of the scheduler is to maximize convergence system-wide. + * Once a task has converged, it carries with it a non-trivial amount + * of working set. If such a task is migrated to another node later + * on then its working set will migrate there as well, which is a + * non-trivial cost. + * + * So the ultimate goal of NUMA scheduling is to let as many tasks + * converge as possible, and to run them as close to their memory + * as possible. + * + * ( Note: we could also sample migration activities to directly measure + * how much convergence influx there is. ) + */ +static void +shared_fault_calc_convergence(struct task_struct *p, int max_node, + unsigned long max_faults, unsigned long sum_faults) +{ + /* + * If sum_faults is 0 then leave the convergence alone: + */ + if (sum_faults) { + p->convergence_strength = 1024L * max_faults / sum_faults; + + if (p->convergence_strength >= 921) { + WARN_ON_ONCE(max_node == -1); + p->convergence_node = max_node; + } else { + p->convergence_node = -1; + } + } +} + +/* * Called every couple of hundred milliseconds in the task's * execution life-time, this function decides whether to * change placement parameters: @@ -1974,6 +2018,8 @@ static void task_numa_placement_tick(struct task_struct *p) } } + shared_fault_calc_convergence(p, ideal_node, max_faults, total[0] + total[1]); + shared_fault_full_scan_done(p); /* -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>