On Tue, 2013-07-30 at 13:18 +0530, Srikar Dronamraju wrote: > Here is an approach that looks to consolidate workloads across nodes. > This results in much improved performance. Again I would assume this work > is complementary to Mel's work with numa faulting. > > Here are the advantages of this approach. > 1. Provides excellent consolidation of tasks. > From my experiments, I have found that the better the task > consolidation, we achieve better the memory layout, which results in > better the performance. > > 2. Provides good improvement in most cases, but there are some regressions. > > 3. Looks to extend the load balancer esp when the cpus are idling. > > Here is the outline of the approach. > > - Every process has a per node array where we store the weight of all > its tasks running on that node. This arrays gets updated on task > enqueue/dequeue. > > - Added a 2 pass mechanism (somewhat taken from numacore but not > exactly) while choosing tasks to move across nodes. > > In the first pass, choose only tasks that are ideal to be moved. > While choosing a task, look at the per node process arrays to see if > moving task helps. > If the first pass fails to move a task, any task can be chosen on the > second pass. > > - If the regular load balancer (rebalance_domain()) fails to balance the > load (or finds no imbalance) and there is a cpu, use the cpu to > consolidate tasks to the nodes by using the information in the per > node process arrays. > > Every idle cpu if its doesnt have tasks queued after load balance, > - will walk thro the cpus in its node and checks if there are buddy > tasks that are not part of the node but should have been ideally > part of this node. > - To make sure that we dont pull all buddy tasks and create an > imbalance, we look at load on the load, pinned tasks and the > processes contribution to the load for this node. > - Each cpu looks at the node which has the least number of buddy tasks > running and tries to pull the tasks from such nodes. > > - Once it finds the cpu from which to pull the tasks, it triggers > active_balancing. This type of active balancing triggers just one > pass. i.e it only fetches tasks that increase numa locality. > > Here are results of specjbb run on a 2 node machine. Here's a comparison with 4 KVM VMs running dbench on a 4 socket, 40 core, 80 thread host. kernel total dbench throughpout 3-9.numabal-on 21242 3.9-numabal-off 20455 3.9-numabal-on-consolidate 22541 3.9-numabal-off-consolidate 21632 3.9-numabal-off-node-pinning 26450 3.9-numabal-on-node-pinning 25265 Based on the node pinning results, we have a long way to go, with either numa-balancing and/or consolidation. One thing the consolidation helps is actually getting the sibling tasks running in the same node: % CPU usage by node for 1st VM node00 node01 node02 node03 094% 002% 001% 001% However, the node which was chosen to consolidate tasks is not the same node where most of the memory for the tasks is located: % memory per node for 1st VM host-node00 host-node01 host-node02 host-node03 ----------- ----------- ----------- ---------- VM-node00 295937(034%) 550400(064%) 6144(000%) 0(000%) By comparison, same stats for numa-balancing on and no consolidation: % CPU usage by node for 1st VM node00 node01 node02 node03 028% 027% 020% 023% <-CPU usage spread across whole system % memory per node for 1st VM host-node00 host-node01 host-node02 host-node03 ----------- ----------- ----------- ----------- VM-node00| 49153(006%) 673792(083%) 51712(006%) 36352(004%) I think the consolidation is a nice concept, but it needs a much tighter integration with numa balancing. The action to clump tasks on same node's runqueues should be triggered by detecting that they also access the same memory. > Specjbb was run on 3 vms. > In the fit case, one vm was big to fit one node size. > In the no-fit case, one vm was bigger than the node size. > > ------------------------------------------------------------------------------------- > |kernel | nofit| fit| vm| > |kernel | noksm| ksm| noksm| ksm| vm| > |kernel | nothp| thp| nothp| thp| nothp| thp| nothp| thp| vm| > -------------------------------------------------------------------------------------- > |v3.9 | 136056| 189423| 135359| 186722| 136983| 191669| 136728| 184253| vm_1| > |v3.9 | 66041| 84779| 64564| 86645| 67426| 84427| 63657| 85043| vm_2| > |v3.9 | 67322| 83301| 63731| 85394| 65015| 85156| 63838| 84199| vm_3| > -------------------------------------------------------------------------------------- > |v3.9 + Mel(v5)| 133170| 177883| 136385| 176716| 140650| 174535| 132811| 190120| vm_1| > |v3.9 + Mel(v5)| 65021| 81707| 62876| 81826| 63635| 84943| 58313| 78997| vm_2| > |v3.9 + Mel(v5)| 61915| 82198| 60106| 81723| 64222| 81123| 59559| 78299| vm_3| > | % change | -2.12| -6.09| 0.76| -5.36| 2.68| -8.94| -2.86| 3.18| vm_1| > | % change | -1.54| -3.62| -2.61| -5.56| -5.62| 0.61| -8.39| -7.11| vm_2| > | % change | -8.03| -1.32| -5.69| -4.30| -1.22| -4.74| -6.70| -7.01| vm_3| > -------------------------------------------------------------------------------------- > |v3.9 + this | 136766| 189704| 148642| 180723| 147474| 184711| 139270| 186768| vm_1| > |v3.9 + this | 72742| 86980| 67561| 91659| 69781| 87741| 65989| 83508| vm_2| > |v3.9 + this | 66075| 90591| 66135| 90059| 67942| 87229| 66100| 85908| vm_3| > | % change | 0.52| 0.15| 9.81| -3.21| 7.66| -3.63| 1.86| 1.36| vm_1| > | % change | 10.15| 2.60| 4.64| 5.79| 3.49| 3.93| 3.66| -1.80| vm_2| > | % change | -1.85| 8.75| 3.77| 5.46| 4.50| 2.43| 3.54| 2.03| vm_3| > -------------------------------------------------------------------------------------- > > > Autonuma benchmark results on a 2 node machine: > KernelVersion: 3.9.0 > Testcase: Min Max Avg StdDev > numa01: 118.98 122.37 120.96 1.17 > numa01_THREAD_ALLOC: 279.84 284.49 282.53 1.65 > numa02: 36.84 37.68 37.09 0.31 > numa02_SMT: 44.67 48.39 47.32 1.38 > > KernelVersion: 3.9.0 + Mel's v5 > Testcase: Min Max Avg StdDev %Change > numa01: 115.02 123.08 120.83 3.04 0.11% > numa01_THREAD_ALLOC: 268.59 298.47 281.15 11.16 0.46% > numa02: 36.31 37.34 36.68 0.43 1.10% > numa02_SMT: 43.18 43.43 43.29 0.08 9.28% > > KernelVersion: 3.9.0 + this patchset > Testcase: Min Max Avg StdDev %Change > numa01: 103.46 112.31 106.44 3.10 12.93% > numa01_THREAD_ALLOC: 277.51 289.81 283.88 4.98 -0.47% > numa02: 36.72 40.81 38.42 1.85 -3.26% > numa02_SMT: 56.50 60.00 58.08 1.23 -17.93% > > KernelVersion: 3.9.0(HT) > Testcase: Min Max Avg StdDev > numa01: 241.23 244.46 242.94 1.31 > numa01_THREAD_ALLOC: 301.95 307.39 305.04 2.20 > numa02: 41.31 43.92 42.98 1.02 > numa02_SMT: 37.02 37.58 37.44 0.21 > > KernelVersion: 3.9.0 + Mel's v5 (HT) > Testcase: Min Max Avg StdDev %Change > numa01: 238.42 242.62 241.60 1.60 0.55% > numa01_THREAD_ALLOC: 285.01 298.23 291.54 5.37 4.53% > numa02: 38.08 38.16 38.11 0.03 12.76% > numa02_SMT: 36.20 36.64 36.36 0.17 2.95% > > KernelVersion: 3.9.0 + this patchset(HT) > Testcase: Min Max Avg StdDev %Change > numa01: 175.17 189.61 181.90 5.26 32.19% > numa01_THREAD_ALLOC: 285.79 365.26 305.27 30.35 -0.06% > numa02: 38.26 38.97 38.50 0.25 11.50% > numa02_SMT: 44.66 49.22 46.22 1.60 -17.84% > > > Autonuma benchmark results on a 4 node machine: > # dmidecode | grep 'Product Name:' > Product Name: System x3750 M4 -[8722C1A]- > # numactl -H > available: 4 nodes (0-3) > node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39 > node 0 size: 65468 MB > node 0 free: 63890 MB > node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47 > node 1 size: 65536 MB > node 1 free: 64033 MB > node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 > node 2 size: 65536 MB > node 2 free: 64236 MB > node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63 > node 3 size: 65536 MB > node 3 free: 64162 MB > node distances: > node 0 1 2 3 > 0: 10 11 11 12 > 1: 11 10 12 11 > 2: 11 12 10 11 > 3: 12 11 11 10 > > KernelVersion: 3.9.0 > Testcase: Min Max Avg StdDev > numa01: 581.35 761.95 681.23 80.97 > numa01_THREAD_ALLOC: 140.39 164.45 150.34 7.98 > numa02: 18.47 20.12 19.25 0.65 > numa02_SMT: 16.40 25.30 21.06 2.86 > > KernelVersion: 3.9.0 + Mel's v5 patchset > Testcase: Min Max Avg StdDev %Change > numa01: 733.15 767.99 748.88 14.51 -8.81% > numa01_THREAD_ALLOC: 154.18 169.13 160.48 5.76 -6.00% > numa02: 19.09 22.15 21.02 1.03 -7.99% > numa02_SMT: 23.01 25.53 23.98 0.83 -11.44% > > KernelVersion: 3.9.0 + this patchset > Testcase: Min Max Avg StdDev %Change > numa01: 409.64 457.91 444.55 17.66 51.69% > numa01_THREAD_ALLOC: 158.10 174.89 169.32 5.84 -10.85% > numa02: 18.89 22.36 19.98 1.29 -3.26% > numa02_SMT: 23.33 27.87 25.02 1.68 -14.21% > > > KernelVersion: 3.9.0 (HT) > Testcase: Min Max Avg StdDev > numa01: 567.62 752.06 620.26 66.72 > numa01_THREAD_ALLOC: 145.84 172.44 160.73 10.34 > numa02: 18.11 20.06 19.10 0.67 > numa02_SMT: 17.59 22.83 19.94 2.17 > > KernelVersion: 3.9.0 + Mel's v5 patchset (HT) > Testcase: Min Max Avg StdDev %Change > numa01: 741.13 753.91 748.10 4.51 -16.96% > numa01_THREAD_ALLOC: 153.57 162.45 158.22 3.18 1.55% > numa02: 19.15 20.96 20.04 0.64 -4.48% > numa02_SMT: 22.57 25.92 23.87 1.15 -15.16% > > KernelVersion: 3.9.0 + this patchset (HT) > Testcase: Min Max Avg StdDev %Change > numa01: 418.46 457.77 436.00 12.81 40.25% > numa01_THREAD_ALLOC: 156.21 169.79 163.75 4.37 -1.78% > numa02: 18.41 20.18 19.06 0.60 0.20% > numa02_SMT: 22.72 27.24 25.29 1.76 -19.64% > > > Autonuma results on a 8 node machine: > > # dmidecode | grep 'Product Name:' > Product Name: IBM x3950-[88722RZ]- > > # numactl -H > available: 8 nodes (0-7) > node 0 cpus: 0 1 2 3 4 5 6 7 > node 0 size: 32510 MB > node 0 free: 31475 MB > node 1 cpus: 8 9 10 11 12 13 14 15 > node 1 size: 32512 MB > node 1 free: 31709 MB > node 2 cpus: 16 17 18 19 20 21 22 23 > node 2 size: 32512 MB > node 2 free: 31737 MB > node 3 cpus: 24 25 26 27 28 29 30 31 > node 3 size: 32512 MB > node 3 free: 31736 MB > node 4 cpus: 32 33 34 35 36 37 38 39 > node 4 size: 32512 MB > node 4 free: 31739 MB > node 5 cpus: 40 41 42 43 44 45 46 47 > node 5 size: 32512 MB > node 5 free: 31639 MB > node 6 cpus: 48 49 50 51 52 53 54 55 > node 6 size: 65280 MB > node 6 free: 63836 MB > node 7 cpus: 56 57 58 59 60 61 62 63 > node 7 size: 65280 MB > node 7 free: 64043 MB > node distances: > node 0 1 2 3 4 5 6 7 > 0: 10 20 20 20 20 20 20 20 > 1: 20 10 20 20 20 20 20 20 > 2: 20 20 10 20 20 20 20 20 > 3: 20 20 20 10 20 20 20 20 > 4: 20 20 20 20 10 20 20 20 > 5: 20 20 20 20 20 10 20 20 > 6: 20 20 20 20 20 20 10 20 > 7: 20 20 20 20 20 20 20 10 > > KernelVersion: 3.9.0 > Testcase: Min Max Avg StdDev > numa01: 1796.11 1848.89 1812.39 19.35 > numa02: 55.05 62.32 58.30 2.37 > > KernelVersion: 3.9.0-mel_numa_balancing+() > Testcase: Min Max Avg StdDev %Change > numa01: 1758.01 1929.12 1853.15 77.15 -2.11% > numa02: 50.96 53.63 52.12 0.90 11.52% > > KernelVersion: 3.9.0-numa_balancing_v39+() > Testcase: Min Max Avg StdDev %Change > numa01: 1081.66 1939.94 1500.01 350.20 16.10% > numa02: 35.32 43.92 38.64 3.35 44.76% > > > TODOs: > 1. Use task loads for numa weights > 2. Use numa faults as secondary key while moving threads > > > Andrea Arcangeli (1): > x86, mm: Prevent gcc to re-read the pagetables > > Srikar Dronamraju (9): > sched: Introduce per node numa weights > sched: Use numa weights while migrating tasks > sched: Select a better task to pull across node using iterations > sched: Move active_load_balance_cpu_stop to a new helper function > sched: Extend idle balancing to look for consolidation of tasks > sched: Limit migrations from a node > sched: Pass hint to active balancer about the task to be chosen > sched: Prevent a task from migrating immediately after an active balance > sched: Choose a runqueue that has lesser local affinity tasks > > arch/x86/mm/gup.c | 23 ++- > fs/exec.c | 6 + > include/linux/mm_types.h | 2 + > include/linux/sched.h | 4 + > kernel/fork.c | 11 +- > kernel/sched/core.c | 2 + > kernel/sched/fair.c | 443 ++++++++++++++++++++++++++++++++++++++++++++-- > kernel/sched/sched.h | 4 + > mm/memory.c | 2 +- > 9 files changed, 475 insertions(+), 22 deletions(-) > -Andrew -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>