We have NUMA Balancing feature which always trying to move pages of a task to the node it executed more, while still got issues: * page cache can't be handled * no cgroup level balancing Suppose we have a box with 4 cpu, two cgroup A & B each running 4 tasks, below scenery could be easily observed: NODE0 | NODE1 | CPU0 CPU1 | CPU2 CPU3 task_A0 task_A1 | task_A2 task_A3 task_B0 task_B1 | task_B2 task_B3 and usually with the equal memory consumption on each node, when tasks have similar behavior. In this case numa balancing try to move pages of task_A0,1 & task_B0,1 to node 0, pages of task_A2,3 & task_B2,3 to node 1, but page cache will be located randomly, depends on the first read/write CPU location. Let's suppose another scenery: NODE0 | NODE1 | CPU0 CPU1 | CPU2 CPU3 task_A0 task_A1 | task_B0 task_B1 task_A2 task_A3 | task_B2 task_B3 By switching the cpu & memory resources of task_A0,1 and task_B0,1, now workloads of cgroup A all on node 0, and cgroup B all on node 1, resource consumption are same but related tasks could share a closer cpu cache, while cache still randomly located. Now what if the workloads generate lot's of page cache, and most of the memory accessing are page cache writing? A page cache generated by task_A0 on NODE1 won't follow it to NODE0, but if task_A0 was already on NODE0 before it read/write files, caches will be there, so how to make sure this happen? Usually we could solve this problem by binding workloads on a single node, if the cgroup A was binding to CPU0,1, then all the caches it generated will be on NODE0, the numa bonus will be maximum. However, this require a very well administration on specified workloads, suppose in our cases if A & B are with a changing CPU requirement from 0% to 400%, then binding to a single node would be a bad idea. So what we need is a way to detect memory topology on cgroup level, and try to migrate cpu/mem resources to the node with most of the caches there, as long as the resource is plenty on that node. This patch set introduced: * advanced per-cgroup numa statistic * numa preferred node feature * Numa Balancer module Which helps to achieve an easy and flexible numa resource assignment, to gain numa bonus as much as possible. Michael Wang (5): numa: introduce per-cgroup numa balancing locality statistic numa: append per-node execution info in memory.numa_stat numa: introduce per-cgroup preferred numa node numa: introduce numa balancer infrastructure numa: numa balancer drivers/Makefile | 1 + drivers/numa/Makefile | 1 + drivers/numa/numa_balancer.c | 715 +++++++++++++++++++++++++++++++++++++++++++ include/linux/memcontrol.h | 99 ++++++ include/linux/sched.h | 9 +- kernel/sched/debug.c | 8 + kernel/sched/fair.c | 41 +++ mm/huge_memory.c | 7 +- mm/memcontrol.c | 246 +++++++++++++++ mm/memory.c | 9 +- mm/mempolicy.c | 4 + 11 files changed, 1133 insertions(+), 7 deletions(-) create mode 100644 drivers/numa/Makefile create mode 100644 drivers/numa/numa_balancer.c -- 2.14.4.44.g2045bb6