On Wed, Jul 03, 2019 at 11:34:16AM +0800, 王贇 wrote: > Although we paid so many effort to settle down task on a particular > node, there are still chances for a task to leave it's preferred > node, that is by wakeup, numa swap migrations or load balance. > > When we are using cpu cgroup in share way, since all the workloads > see all the cpus, it could be really bad especially when there > are too many fast wakeup, although now we can numa group the tasks, > they won't really stay on the same node, for example we have numa > group ng_A, ng_B, ng_C, ng_D, it's very likely result as: > > CPU Usage: > Node 0 Node 1 > ng_A(600%) ng_A(400%) > ng_B(400%) ng_B(600%) > ng_C(400%) ng_C(600%) > ng_D(600%) ng_D(400%) > > Memory Ratio: > Node 0 Node 1 > ng_A(60%) ng_A(40%) > ng_B(40%) ng_B(60%) > ng_C(40%) ng_C(60%) > ng_D(60%) ng_D(40%) > > Locality won't be too bad but far from the best situation, we want > a numa group to settle down thoroughly on a particular node, with > every thing balanced. > > Thus we introduce the numa cling, which try to prevent tasks leaving > the preferred node on wakeup fast path. > @@ -6195,6 +6447,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > if ((unsigned)i < nr_cpumask_bits) > return i; > > + /* > + * Failed to find an idle cpu, wake affine may want to pull but > + * try stay on prev-cpu when the task cling to it. > + */ > + if (task_numa_cling(p, cpu_to_node(prev), cpu_to_node(target))) > + return prev; > + > return target; > } Select idle sibling should never cross node boundaries and is thus the entirely wrong place to fix anything.