On Mon, Jan 20, 2014 at 02:21:06PM -0500, riel@xxxxxxxxxx wrote: > From: Rik van Riel <riel@xxxxxxxxxx> > > The tracepoint has made it abundantly clear that the naive > implementation of the faults_from code has issues. > > Specifically, the garbage collector in some workloads will > access orders of magnitudes more memory than the threads > that do all the active work. This resulted in the node with > the garbage collector being marked the only active node in > the group. > Maybe I should have read this patch before getting into a twist about the earlier patches in the series and the treatment of active_mask :(. On the plus side, even without reading the code I can still see the motivation for this paragraph. > This issue is avoided if we weigh the statistics by CPU use > of each task in the numa group, instead of by how many faults > each thread has occurred. > Bah, yes. Because in my earlier review I was worried about the faults being missed. If the faults stats are scaled by the CPU statistics then it is a very rough proxy measure for how heavily a particular node is being referenced by a process. > To achieve this, we normalize the number of faults to the > fraction of faults that occurred on each node, and then > multiply that fraction by the fraction of CPU time the > task has used since the last time task_numa_placement was > invoked. > > This way the nodes in the active node mask will be the ones > where the tasks from the numa group are most actively running, > and the influence of eg. the garbage collector and other > do-little threads is properly minimized. > > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Cc: Mel Gorman <mgorman@xxxxxxx> > Cc: Ingo Molnar <mingo@xxxxxxxxxx> > Cc: Chegu Vinod <chegu_vinod@xxxxxx> > Signed-off-by: Rik van Riel <riel@xxxxxxxxxx> > --- > kernel/sched/fair.c | 21 +++++++++++++++++++-- > 1 file changed, 19 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ea873b6..203877d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1426,6 +1426,8 @@ static void task_numa_placement(struct task_struct *p) > int seq, nid, max_nid = -1, max_group_nid = -1; > unsigned long max_faults = 0, max_group_faults = 0; > unsigned long fault_types[2] = { 0, 0 }; > + unsigned long total_faults; > + u64 runtime, period; > spinlock_t *group_lock = NULL; > > seq = ACCESS_ONCE(p->mm->numa_scan_seq); > @@ -1434,6 +1436,11 @@ static void task_numa_placement(struct task_struct *p) > p->numa_scan_seq = seq; > p->numa_scan_period_max = task_scan_max(p); > > + total_faults = p->numa_faults_locality[0] + > + p->numa_faults_locality[1] + 1; Depending on how you reacted to the review of other patches this may or may not have a helper now. > + runtime = p->se.avg.runnable_avg_sum; > + period = p->se.avg.runnable_avg_period; > + Ok, IIRC these stats are based a decaying average based on recent history so heavy activity followed by long periods of idle will not skew the stats. > /* If the task is part of a group prevent parallel updates to group stats */ > if (p->numa_group) { > group_lock = &p->numa_group->lock; > @@ -1446,7 +1453,7 @@ static void task_numa_placement(struct task_struct *p) > int priv, i; > > for (priv = 0; priv < 2; priv++) { > - long diff, f_diff; > + long diff, f_diff, f_weight; > > i = task_faults_idx(nid, priv); > diff = -p->numa_faults[i]; > @@ -1458,8 +1465,18 @@ static void task_numa_placement(struct task_struct *p) > fault_types[priv] += p->numa_faults_buffer[i]; > p->numa_faults_buffer[i] = 0; > > + /* > + * Normalize the faults_from, so all tasks in a group > + * count according to CPU use, instead of by the raw > + * number of faults. Tasks with little runtime have > + * little over-all impact on throughput, and thus their > + * faults are less important. > + */ > + f_weight = (16384 * runtime * > + p->numa_faults_from_buffer[i]) / > + (total_faults * period + 1); Why 16384? It looks like a scaling factor to deal with integer approximations but I'm not 100% sure and I do not see how you arrived at that value. > p->numa_faults_from[i] >>= 1; > - p->numa_faults_from[i] += p->numa_faults_from_buffer[i]; > + p->numa_faults_from[i] += f_weight; > p->numa_faults_from_buffer[i] = 0; > numa_faults_from needs a big comment that it's no longer about the number of faults in it. It's the sum of faults measured by the group weighted by the CPU > faults += p->numa_faults[i]; > -- > 1.8.4.2 > -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>