Re: [RFC PATCH 0/5] NUMA Balancer Suite

禹舟键 <ufo19890607@xxxxxxxxx> · Mon, 22 Apr 2019 22:34:32 +0800

Hi, MichaelI really want to know how could you fix the conflict between numa balancer and load balancer. Maybe you gained numa bonus by migrating some tasks to the node with most of the cache there, but, cpu load balance was break, so how to do it ?

Thanks
Wind

王贇 <yun.wang@xxxxxxxxxxxxxxxxx> 于2019年4月22日周一 上午10:13写道：
We have NUMA Balancing feature which always trying to move pages

of a task to the node it executed more, while still got issues:

* page cache can't be handled

* no cgroup level balancing

Suppose we have a box with 4 cpu, two cgroup A & B each running 4 tasks,

below scenery could be easily observed:

NODE0                   |       NODE1

                        |

CPU0            CPU1    |       CPU2            CPU3

task_A0         task_A1 |       task_A2         task_A3

task_B0         task_B1 |       task_B2         task_B3

and usually with the equal memory consumption on each node, when tasks have

similar behavior.

In this case numa balancing try to move pages of task_A0,1 & task_B0,1 to node 0,

pages of task_A2,3 & task_B2,3 to node 1, but page cache will be located randomly,

depends on the first read/write CPU location.

Let's suppose another scenery:

NODE0                   |       NODE1

                        |

CPU0            CPU1    |       CPU2            CPU3

task_A0         task_A1 |       task_B0         task_B1

task_A2         task_A3 |       task_B2         task_B3

By switching the cpu & memory resources of task_A0,1 and task_B0,1, now workloads

of cgroup A all on node 0, and cgroup B all on node 1, resource consumption are same

but related tasks could share a closer cpu cache, while cache still randomly located.

Now what if the workloads generate lot's of page cache, and most of the memory

accessing are page cache writing?

A page cache generated by task_A0 on NODE1 won't follow it to NODE0, but if task_A0

was already on NODE0 before it read/write files, caches will be there, so how to

make sure this happen?

Usually we could solve this problem by binding workloads on a single node, if the

cgroup A was binding to CPU0,1, then all the caches it generated will be on NODE0,

the numa bonus will be maximum.

However, this require a very well administration on specified workloads, suppose in our

cases if A & B are with a changing CPU requirement from 0% to 400%, then binding to a

single node would be a bad idea.

So what we need is a way to detect memory topology on cgroup level, and try to migrate

cpu/mem resources to the node with most of the caches there, as long as the resource

is plenty on that node.

This patch set introduced:

  * advanced per-cgroup numa statistic

  * numa preferred node feature

  * Numa Balancer module

Which helps to achieve an easy and flexible numa resource assignment, to gain numa bonus

as much as possible.

Michael Wang (5):

  numa: introduce per-cgroup numa balancing locality statistic

  numa: append per-node execution info in memory.numa_stat

  numa: introduce per-cgroup preferred numa node

  numa: introduce numa balancer infrastructure

  numa: numa balancer

 drivers/Makefile             |   1 +

 drivers/numa/Makefile        |   1 +

 drivers/numa/numa_balancer.c | 715 +++++++++++++++++++++++++++++++++++++++++++

 include/linux/memcontrol.h   |  99 ++++++

 include/linux/sched.h        |   9 +-

 kernel/sched/debug.c         |   8 +

 kernel/sched/fair.c          |  41 +++

 mm/huge_memory.c             |   7 +-

 mm/memcontrol.c              | 246 +++++++++++++++

 mm/memory.c                  |   9 +-

 mm/mempolicy.c               |   4 +

 11 files changed, 1133 insertions(+), 7 deletions(-)

 create mode 100644 drivers/numa/Makefile

 create mode 100644 drivers/numa/numa_balancer.c

-- 

2.14.4.44.g2045bb6