Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

Andrew Theurer <habanero@xxxxxxxxxxxxxxxxxx> · Wed, 31 Jul 2013 08:33:44 -0500

On Tue, 2013-07-30 at 13:18 +0530, Srikar Dronamraju wrote:
> Here is an approach that looks to consolidate workloads across nodes.
> This results in much improved performance. Again I would assume this work
> is complementary to Mel's work with numa faulting.
> 
> Here are the advantages of this approach.
> 1. Provides excellent consolidation of tasks.
>  From my experiments, I have found that the better the task
>  consolidation, we achieve better the memory layout, which results in
>  better the performance.
> 
> 2. Provides good improvement in most cases, but there are some regressions.
> 
> 3. Looks to extend the load balancer esp when the cpus are idling.
> 
> Here is the outline of the approach.
> 
> - Every process has a per node array where we store the weight of all
>   its tasks running on that node. This arrays gets updated on task
>   enqueue/dequeue.
> 
> - Added a 2 pass mechanism (somewhat taken from numacore but not
>   exactly) while choosing tasks to move across nodes.
> 
>   In the first pass, choose only tasks that are ideal to be moved.
>   While choosing a task, look at the per node process arrays to see if
>   moving task helps.
>   If the first pass fails to move a task, any task can be chosen on the
>   second pass.
> 
> - If the regular load balancer (rebalance_domain()) fails to balance the
>   load (or finds no imbalance) and there is a cpu, use the cpu to
>   consolidate tasks to the nodes by using the information in the per
>   node process arrays.
> 
>   Every idle cpu if its doesnt have tasks queued after load balance,
>   - will walk thro the cpus in its node and checks if there are buddy
>     tasks that are not part of the node but should have been ideally
>     part of this node.
>   - To make sure that we dont pull all buddy tasks and create an
>     imbalance, we look at load on the load, pinned tasks and the
>     processes contribution to the load for this node.
>   - Each cpu looks at the node which has the least number of buddy tasks
>     running and tries to pull the tasks from such nodes.
> 
>   - Once it finds the cpu from which to pull the tasks, it triggers
>     active_balancing. This type of active balancing triggers just one
>     pass. i.e it only fetches tasks that increase numa locality.
> 
> Here are results of specjbb run on a 2 node machine.

Here's a comparison with 4 KVM VMs running dbench on a 4 socket, 40
core, 80 thread host.

kernel				total dbench throughpout

3-9.numabal-on			21242
3.9-numabal-off			20455
3.9-numabal-on-consolidate      22541
3.9-numabal-off-consolidate     21632
3.9-numabal-off-node-pinning    26450
3.9-numabal-on-node-pinning     25265

Based on the node pinning results, we have a long way to go, with either
numa-balancing and/or consolidation.  One thing the consolidation helps
is actually getting the sibling tasks running in the same node:

% CPU usage by node for 1st VM
node00 node01 node02  node03
094%    002%    001%    001%

However, the node which was chosen to consolidate tasks is
not the same node where most of the memory for the tasks is located:

% memory per node for 1st VM
              host-node00  host-node01    host-node02    host-node03
             -----------    -----------    -----------   ----------
 VM-node00   295937(034%)   550400(064%)   6144(000%)       0(000%) 

By comparison, same stats for numa-balancing on and no consolidation:

% CPU usage by node for 1st VM
node00 node01 node02 node03
  028%   027%  020%    023%   <-CPU usage spread across whole system

% memory per node for 1st VM
             host-node00    host-node01    host-node02    host-node03
             -----------    -----------    -----------    -----------  
 VM-node00|   49153(006%)   673792(083%)    51712(006%)   36352(004%) 

I think the consolidation is a nice concept, but it needs a much tighter
integration with numa balancing.  The action to clump tasks on same node's
runqueues should be triggered by detecting that they also access
the same memory.

> Specjbb was run on 3 vms.
> In the fit case, one vm was big to fit one node size.
> In the no-fit case, one vm was bigger than the node size.
> 
> -------------------------------------------------------------------------------------
> |kernel        |                          nofit|                            fit|   vm|
> |kernel        |          noksm|            ksm|          noksm|            ksm|   vm|
> |kernel        |  nothp|    thp|  nothp|    thp|  nothp|    thp|  nothp|    thp|   vm|
> --------------------------------------------------------------------------------------
> |v3.9          | 136056| 189423| 135359| 186722| 136983| 191669| 136728| 184253| vm_1|
> |v3.9          |  66041|  84779|  64564|  86645|  67426|  84427|  63657|  85043| vm_2|
> |v3.9          |  67322|  83301|  63731|  85394|  65015|  85156|  63838|  84199| vm_3|
> --------------------------------------------------------------------------------------
> |v3.9 + Mel(v5)| 133170| 177883| 136385| 176716| 140650| 174535| 132811| 190120| vm_1|
> |v3.9 + Mel(v5)|  65021|  81707|  62876|  81826|  63635|  84943|  58313|  78997| vm_2|
> |v3.9 + Mel(v5)|  61915|  82198|  60106|  81723|  64222|  81123|  59559|  78299| vm_3|
> | % change     |  -2.12|  -6.09|   0.76|  -5.36|   2.68|  -8.94|  -2.86|   3.18| vm_1|
> | % change     |  -1.54|  -3.62|  -2.61|  -5.56|  -5.62|   0.61|  -8.39|  -7.11| vm_2|
> | % change     |  -8.03|  -1.32|  -5.69|  -4.30|  -1.22|  -4.74|  -6.70|  -7.01| vm_3|
> --------------------------------------------------------------------------------------
> |v3.9 + this   | 136766| 189704| 148642| 180723| 147474| 184711| 139270| 186768| vm_1|
> |v3.9 + this   |  72742|  86980|  67561|  91659|  69781|  87741|  65989|  83508| vm_2|
> |v3.9 + this   |  66075|  90591|  66135|  90059|  67942|  87229|  66100|  85908| vm_3|
> | % change     |   0.52|   0.15|   9.81|  -3.21|   7.66|  -3.63|   1.86|   1.36| vm_1|
> | % change     |  10.15|   2.60|   4.64|   5.79|   3.49|   3.93|   3.66|  -1.80| vm_2|
> | % change     |  -1.85|   8.75|   3.77|   5.46|   4.50|   2.43|   3.54|   2.03| vm_3|
> --------------------------------------------------------------------------------------
> 
> 
> Autonuma benchmark results on a 2 node machine:
> KernelVersion: 3.9.0
> 		Testcase:      Min      Max      Avg   StdDev
> 		  numa01:   118.98   122.37   120.96     1.17
>      numa01_THREAD_ALLOC:   279.84   284.49   282.53     1.65
> 		  numa02:    36.84    37.68    37.09     0.31
> 	      numa02_SMT:    44.67    48.39    47.32     1.38
> 
> KernelVersion: 3.9.0 + Mel's v5
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   115.02   123.08   120.83     3.04    0.11%
>      numa01_THREAD_ALLOC:   268.59   298.47   281.15    11.16    0.46%
> 		  numa02:    36.31    37.34    36.68     0.43    1.10%
> 	      numa02_SMT:    43.18    43.43    43.29     0.08    9.28%
> 
> KernelVersion: 3.9.0 + this patchset
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   103.46   112.31   106.44     3.10   12.93%
>      numa01_THREAD_ALLOC:   277.51   289.81   283.88     4.98   -0.47%
> 		  numa02:    36.72    40.81    38.42     1.85   -3.26%
> 	      numa02_SMT:    56.50    60.00    58.08     1.23  -17.93%
> 
> KernelVersion: 3.9.0(HT)
> 		Testcase:      Min      Max      Avg   StdDev
> 		  numa01:   241.23   244.46   242.94     1.31
>      numa01_THREAD_ALLOC:   301.95   307.39   305.04     2.20
> 		  numa02:    41.31    43.92    42.98     1.02
> 	      numa02_SMT:    37.02    37.58    37.44     0.21
> 
> KernelVersion: 3.9.0 + Mel's v5 (HT)
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   238.42   242.62   241.60     1.60    0.55%
>      numa01_THREAD_ALLOC:   285.01   298.23   291.54     5.37    4.53%
> 		  numa02:    38.08    38.16    38.11     0.03   12.76%
> 	      numa02_SMT:    36.20    36.64    36.36     0.17    2.95%
> 
> KernelVersion: 3.9.0 + this patchset(HT)
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   175.17   189.61   181.90     5.26   32.19%
>      numa01_THREAD_ALLOC:   285.79   365.26   305.27    30.35   -0.06%
> 		  numa02:    38.26    38.97    38.50     0.25   11.50%
> 	      numa02_SMT:    44.66    49.22    46.22     1.60  -17.84%
> 
> 
> Autonuma benchmark results on a 4 node machine:
> # dmidecode | grep 'Product Name:'
> 	Product Name: System x3750 M4 -[8722C1A]-
> # numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
> node 0 size: 65468 MB
> node 0 free: 63890 MB
> node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
> node 1 size: 65536 MB
> node 1 free: 64033 MB
> node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
> node 2 size: 65536 MB
> node 2 free: 64236 MB
> node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
> node 3 size: 65536 MB
> node 3 free: 64162 MB
> node distances:
> node   0   1   2   3 
>   0:  10  11  11  12 
>   1:  11  10  12  11 
>   2:  11  12  10  11 
>   3:  12  11  11  10 
> 
> KernelVersion: 3.9.0
> 		Testcase:      Min      Max      Avg   StdDev
> 		  numa01:   581.35   761.95   681.23    80.97
>      numa01_THREAD_ALLOC:   140.39   164.45   150.34     7.98
> 		  numa02:    18.47    20.12    19.25     0.65
> 	      numa02_SMT:    16.40    25.30    21.06     2.86
> 
> KernelVersion: 3.9.0 + Mel's v5 patchset
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   733.15   767.99   748.88    14.51   -8.81%
>      numa01_THREAD_ALLOC:   154.18   169.13   160.48     5.76   -6.00%
> 		  numa02:    19.09    22.15    21.02     1.03   -7.99%
> 	      numa02_SMT:    23.01    25.53    23.98     0.83  -11.44%
> 
> KernelVersion: 3.9.0 + this patchset
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   409.64   457.91   444.55    17.66   51.69%
>      numa01_THREAD_ALLOC:   158.10   174.89   169.32     5.84  -10.85%
> 		  numa02:    18.89    22.36    19.98     1.29   -3.26%
> 	      numa02_SMT:    23.33    27.87    25.02     1.68  -14.21%
> 
> 
> KernelVersion: 3.9.0 (HT)
> 		Testcase:      Min      Max      Avg   StdDev
> 		  numa01:   567.62   752.06   620.26    66.72
>      numa01_THREAD_ALLOC:   145.84   172.44   160.73    10.34
> 		  numa02:    18.11    20.06    19.10     0.67
> 	      numa02_SMT:    17.59    22.83    19.94     2.17
> 
> KernelVersion: 3.9.0 + Mel's v5 patchset (HT)
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   741.13   753.91   748.10     4.51  -16.96%
>      numa01_THREAD_ALLOC:   153.57   162.45   158.22     3.18    1.55%
> 		  numa02:    19.15    20.96    20.04     0.64   -4.48%
> 	      numa02_SMT:    22.57    25.92    23.87     1.15  -15.16%
> 
> KernelVersion: 3.9.0 + this patchset (HT)
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   418.46   457.77   436.00    12.81   40.25%
>      numa01_THREAD_ALLOC:   156.21   169.79   163.75     4.37   -1.78%
> 		  numa02:    18.41    20.18    19.06     0.60    0.20%
> 	      numa02_SMT:    22.72    27.24    25.29     1.76  -19.64%
> 
> 
> Autonuma results on a 8 node machine:
> 
> # dmidecode | grep 'Product Name:'
> 	Product Name: IBM x3950-[88722RZ]-
> 
> # numactl -H
> available: 8 nodes (0-7)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 32510 MB
> node 0 free: 31475 MB
> node 1 cpus: 8 9 10 11 12 13 14 15
> node 1 size: 32512 MB
> node 1 free: 31709 MB
> node 2 cpus: 16 17 18 19 20 21 22 23
> node 2 size: 32512 MB
> node 2 free: 31737 MB
> node 3 cpus: 24 25 26 27 28 29 30 31
> node 3 size: 32512 MB
> node 3 free: 31736 MB
> node 4 cpus: 32 33 34 35 36 37 38 39
> node 4 size: 32512 MB
> node 4 free: 31739 MB
> node 5 cpus: 40 41 42 43 44 45 46 47
> node 5 size: 32512 MB
> node 5 free: 31639 MB
> node 6 cpus: 48 49 50 51 52 53 54 55
> node 6 size: 65280 MB
> node 6 free: 63836 MB
> node 7 cpus: 56 57 58 59 60 61 62 63
> node 7 size: 65280 MB
> node 7 free: 64043 MB
> node distances:
> node   0   1   2   3   4   5   6   7 
>   0:  10  20  20  20  20  20  20  20 
>   1:  20  10  20  20  20  20  20  20 
>   2:  20  20  10  20  20  20  20  20 
>   3:  20  20  20  10  20  20  20  20 
>   4:  20  20  20  20  10  20  20  20 
>   5:  20  20  20  20  20  10  20  20 
>   6:  20  20  20  20  20  20  10  20 
>   7:  20  20  20  20  20  20  20  10 
> 
> KernelVersion: 3.9.0
> 	Testcase:      Min      Max      Avg   StdDev
> 	  numa01:  1796.11  1848.89  1812.39    19.35
> 	  numa02:    55.05    62.32    58.30     2.37
> 
> KernelVersion: 3.9.0-mel_numa_balancing+()
> 	Testcase:      Min      Max      Avg   StdDev  %Change
> 	  numa01:  1758.01  1929.12  1853.15    77.15   -2.11%
> 	  numa02:    50.96    53.63    52.12     0.90   11.52%
> 
> KernelVersion: 3.9.0-numa_balancing_v39+()
> 	Testcase:      Min      Max      Avg   StdDev  %Change
> 	  numa01:  1081.66  1939.94  1500.01   350.20   16.10%
> 	  numa02:    35.32    43.92    38.64     3.35   44.76%
> 
> 
> TODOs:
> 1. Use task loads for numa weights
> 2. Use numa faults as secondary key while moving threads
> 
> 
> Andrea Arcangeli (1):
>   x86, mm: Prevent gcc to re-read the pagetables
> 
> Srikar Dronamraju (9):
>   sched: Introduce per node numa weights
>   sched: Use numa weights while migrating tasks
>   sched: Select a better task to pull across node using iterations
>   sched: Move active_load_balance_cpu_stop to a new helper function
>   sched: Extend idle balancing to look for consolidation of tasks
>   sched: Limit migrations from a node
>   sched: Pass hint to active balancer about the task to be chosen
>   sched: Prevent a task from migrating immediately after an active balance
>   sched: Choose a runqueue that has lesser local affinity tasks
> 
>  arch/x86/mm/gup.c        |   23 ++-
>  fs/exec.c                |    6 +
>  include/linux/mm_types.h |    2 +
>  include/linux/sched.h    |    4 +
>  kernel/fork.c            |   11 +-
>  kernel/sched/core.c      |    2 +
>  kernel/sched/fair.c      |  443 ++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h     |    4 +
>  mm/memory.c              |    2 +-
>  9 files changed, 475 insertions(+), 22 deletions(-)
> 

-Andrew

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>