Kirill, I've pasted in my results for 512 cores below. Things are looking really good here. I don't have a test for HUGETLBFS, but if you want to pass me the one you used, I can run that too. I suppose I could write one, but why reinvent the wheel? :) Sorry for the delay on these results. I hit some strange issues with running thp_memscale on systems with either of the following combinations of configuration options set: [thp off] HUGETLBFS=y HUGETLB_PAGE=y NUMA_BALANCING=y NUMA_BALANCING_DEFAULT_ENABLED=y [thp on or off] HUGETLBFS=n HUGETLB_PAGE=n NUMA_BALANCING=y NUMA_BALANCING_DEFAULT_ENABLED=y I'm getting segfaults intermittently, as well as some weird RCU sched errors. This happens in vanilla 3.12-rc2, so it doesn't have anything to do with your patches, but I thought I'd let you know. There didn't used to be any issues with this test, so I think there's a subtle kernel bug here. That's, of course, an entirely separate issue though. As far as these patches go, I think everything looks good (save for the bit of discussion you were having with Andrew earlier, which I think you've worked out). My testing shows that the page fault rates are actually better on this threaded test than in the non-threaded case! - Alex On Fri, Sep 27, 2013 at 04:16:17PM +0300, Kirill A. Shutemov wrote: > Alex Thorlton noticed that some massively threaded workloads work poorly, > if THP enabled. This patchset fixes this by introducing split page table > lock for PMD tables. hugetlbfs is not covered yet. > > This patchset is based on work by Naoya Horiguchi. > > Please review and consider applying. > > Changes: > v4: > - convert hugetlb to new locking; > v3: > - fix USE_SPLIT_PMD_PTLOCKS; > - fix warning in fs/proc/task_mmu.c; > v2: > - reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock; > - s/huge_pmd_lock/pmd_lock/g; > - assume pgtable_pmd_page_ctor() can fail; > - fix format line in task_mem() for VmPTE; > > THP off, v3.12-rc2: > ------------------- > > Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs): > > 1037072.835207 task-clock # 57.426 CPUs utilized ( +- 3.59% ) > 95,093 context-switches # 0.092 K/sec ( +- 3.93% ) > 140 cpu-migrations # 0.000 K/sec ( +- 5.28% ) > 10,000,550 page-faults # 0.010 M/sec ( +- 0.00% ) > 2,455,210,400,261 cycles # 2.367 GHz ( +- 3.62% ) [83.33%] > 2,429,281,882,056 stalled-cycles-frontend # 98.94% frontend cycles idle ( +- 3.67% ) [83.33%] > 1,975,960,019,659 stalled-cycles-backend # 80.48% backend cycles idle ( +- 3.88% ) [66.68%] > 46,503,296,013 instructions # 0.02 insns per cycle > # 52.24 stalled cycles per insn ( +- 3.21% ) [83.34%] > 9,278,997,542 branches # 8.947 M/sec ( +- 4.00% ) [83.34%] > 89,881,640 branch-misses # 0.97% of all branches ( +- 1.17% ) [83.33%] > > 18.059261877 seconds time elapsed ( +- 2.65% ) > > THP on, v3.12-rc2: > ------------------ > > Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs): > > 3114745.395974 task-clock # 73.875 CPUs utilized ( +- 1.84% ) > 267,356 context-switches # 0.086 K/sec ( +- 1.84% ) > 99 cpu-migrations # 0.000 K/sec ( +- 1.40% ) > 58,313 page-faults # 0.019 K/sec ( +- 0.28% ) > 7,416,635,817,510 cycles # 2.381 GHz ( +- 1.83% ) [83.33%] > 7,342,619,196,993 stalled-cycles-frontend # 99.00% frontend cycles idle ( +- 1.88% ) [83.33%] > 6,267,671,641,967 stalled-cycles-backend # 84.51% backend cycles idle ( +- 2.03% ) [66.67%] > 117,819,935,165 instructions # 0.02 insns per cycle > # 62.32 stalled cycles per insn ( +- 4.39% ) [83.34%] > 28,899,314,777 branches # 9.278 M/sec ( +- 4.48% ) [83.34%] > 71,787,032 branch-misses # 0.25% of all branches ( +- 1.03% ) [83.33%] > > 42.162306788 seconds time elapsed ( +- 1.73% ) THP on, v3.12-rc2: ------------------ Performance counter stats for './thp_memscale -C 0 -m 0 -c 512 -b 512m' (5 runs): 568668865.944994 task-clock # 528.547 CPUs utilized ( +- 0.21% ) [100.00%] 1,491,589 context-switches # 0.000 M/sec ( +- 0.25% ) [100.00%] 1,085 CPU-migrations # 0.000 M/sec ( +- 1.80% ) [100.00%] 400,822 page-faults # 0.000 M/sec ( +- 0.41% ) 1,306,612,476,049,478 cycles # 2.298 GHz ( +- 0.23% ) [100.00%] 1,277,211,694,318,724 stalled-cycles-frontend # 97.75% frontend cycles idle ( +- 0.21% ) [100.00%] 1,163,736,844,232,064 stalled-cycles-backend # 89.07% backend cycles idle ( +- 0.20% ) [100.00%] 53,855,178,678,230 instructions # 0.04 insns per cycle # 23.72 stalled cycles per insn ( +- 1.15% ) [100.00%] 21,041,661,816,782 branches # 37.002 M/sec ( +- 0.64% ) [100.00%] 606,665,092 branch-misses # 0.00% of all branches ( +- 0.63% ) 1075.909782795 seconds time elapsed ( +- 0.21% ) > HUGETLB, v3.12-rc2: > ------------------- > > Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs): > > 2588052.787264 task-clock # 54.400 CPUs utilized ( +- 3.69% ) > 246,831 context-switches # 0.095 K/sec ( +- 4.15% ) > 138 cpu-migrations # 0.000 K/sec ( +- 5.30% ) > 21,027 page-faults # 0.008 K/sec ( +- 0.01% ) > 6,166,666,307,263 cycles # 2.383 GHz ( +- 3.68% ) [83.33%] > 6,086,008,929,407 stalled-cycles-frontend # 98.69% frontend cycles idle ( +- 3.77% ) [83.33%] > 5,087,874,435,481 stalled-cycles-backend # 82.51% backend cycles idle ( +- 4.41% ) [66.67%] > 133,782,831,249 instructions # 0.02 insns per cycle > # 45.49 stalled cycles per insn ( +- 4.30% ) [83.34%] > 34,026,870,541 branches # 13.148 M/sec ( +- 4.24% ) [83.34%] > 68,670,942 branch-misses # 0.20% of all branches ( +- 3.26% ) [83.33%] > > 47.574936948 seconds time elapsed ( +- 2.09% ) > > THP off, patched: > ----------------- > > Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs): > > 943301.957892 task-clock # 56.256 CPUs utilized ( +- 3.01% ) > 86,218 context-switches # 0.091 K/sec ( +- 3.17% ) > 121 cpu-migrations # 0.000 K/sec ( +- 6.64% ) > 10,000,551 page-faults # 0.011 M/sec ( +- 0.00% ) > 2,230,462,457,654 cycles # 2.365 GHz ( +- 3.04% ) [83.32%] > 2,204,616,385,805 stalled-cycles-frontend # 98.84% frontend cycles idle ( +- 3.09% ) [83.32%] > 1,778,640,046,926 stalled-cycles-backend # 79.74% backend cycles idle ( +- 3.47% ) [66.69%] > 45,995,472,617 instructions # 0.02 insns per cycle > # 47.93 stalled cycles per insn ( +- 2.51% ) [83.34%] > 9,179,700,174 branches # 9.731 M/sec ( +- 3.04% ) [83.35%] > 89,166,529 branch-misses # 0.97% of all branches ( +- 1.45% ) [83.33%] > > 16.768027318 seconds time elapsed ( +- 2.47% ) > > THP on, patched: > ---------------- > > Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs): > > 458793.837905 task-clock # 54.632 CPUs utilized ( +- 0.79% ) > 41,831 context-switches # 0.091 K/sec ( +- 0.97% ) > 98 cpu-migrations # 0.000 K/sec ( +- 1.66% ) > 57,829 page-faults # 0.126 K/sec ( +- 0.62% ) > 1,077,543,336,716 cycles # 2.349 GHz ( +- 0.81% ) [83.33%] > 1,067,403,802,964 stalled-cycles-frontend # 99.06% frontend cycles idle ( +- 0.87% ) [83.33%] > 864,764,616,143 stalled-cycles-backend # 80.25% backend cycles idle ( +- 0.73% ) [66.68%] > 16,129,177,440 instructions # 0.01 insns per cycle > # 66.18 stalled cycles per insn ( +- 7.94% ) [83.35%] > 3,618,938,569 branches # 7.888 M/sec ( +- 8.46% ) [83.36%] > 33,242,032 branch-misses # 0.92% of all branches ( +- 2.02% ) [83.32%] > > 8.397885779 seconds time elapsed ( +- 0.18% ) THP on, patched: ---------------- Performance counter stats for './runt -t -c 512 -b 512m' (5 runs): 15836198.490485 task-clock # 533.304 CPUs utilized ( +- 0.95% ) [100.00%] 127,507 context-switches # 0.000 M/sec ( +- 1.65% ) [100.00%] 1,223 CPU-migrations # 0.000 M/sec ( +- 3.23% ) [100.00%] 302,080 page-faults # 0.000 M/sec ( +- 6.88% ) 18,925,875,973,975 cycles # 1.195 GHz ( +- 0.43% ) [100.00%] 18,325,469,464,007 stalled-cycles-frontend # 96.83% frontend cycles idle ( +- 0.44% ) [100.00%] 17,522,272,147,141 stalled-cycles-backend # 92.58% backend cycles idle ( +- 0.49% ) [100.00%] 2,686,490,067,197 instructions # 0.14 insns per cycle # 6.82 stalled cycles per insn ( +- 2.16% ) [100.00%] 944,712,646,402 branches # 59.655 M/sec ( +- 2.03% ) [100.00%] 145,956,565 branch-misses # 0.02% of all branches ( +- 0.88% ) 29.694499652 seconds time elapsed ( +- 0.95% ) (these results are from the test suite that I ripped thp_memscale out of, but it's the same test) > HUGETLB, patched > ----------------- > > Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs): > > 395353.076837 task-clock # 20.329 CPUs utilized ( +- 8.16% ) > 55,730 context-switches # 0.141 K/sec ( +- 5.31% ) > 138 cpu-migrations # 0.000 K/sec ( +- 4.24% ) > 21,027 page-faults # 0.053 K/sec ( +- 0.00% ) > 930,219,717,244 cycles # 2.353 GHz ( +- 8.21% ) [83.32%] > 914,295,694,103 stalled-cycles-frontend # 98.29% frontend cycles idle ( +- 8.35% ) [83.33%] > 704,137,950,187 stalled-cycles-backend # 75.70% backend cycles idle ( +- 9.16% ) [66.69%] > 30,541,538,385 instructions # 0.03 insns per cycle > # 29.94 stalled cycles per insn ( +- 3.98% ) [83.35%] > 8,415,376,631 branches # 21.286 M/sec ( +- 3.61% ) [83.36%] > 32,645,478 branch-misses # 0.39% of all branches ( +- 3.41% ) [83.32%] > > 19.447481153 seconds time elapsed ( +- 2.00% ) > > Kirill A. Shutemov (10): > mm: rename USE_SPLIT_PTLOCKS to USE_SPLIT_PTE_PTLOCKS > mm: convert mm->nr_ptes to atomic_t > mm: introduce api for split page table lock for PMD level > mm, thp: change pmd_trans_huge_lock() to return taken lock > mm, thp: move ptl taking inside page_check_address_pmd() > mm, thp: do not access mm->pmd_huge_pte directly > mm, hugetlb: convert hugetlbfs to use split pmd lock > mm: convent the rest to new page table lock api > mm: implement split page table lock for PMD level > x86, mm: enable split page table lock for PMD level > > arch/arm/mm/fault-armv.c | 6 +- > arch/s390/mm/pgtable.c | 12 +-- > arch/sparc/mm/tlb.c | 12 +-- > arch/x86/Kconfig | 4 + > arch/x86/include/asm/pgalloc.h | 11 ++- > arch/x86/xen/mmu.c | 6 +- > fs/proc/meminfo.c | 2 +- > fs/proc/task_mmu.c | 16 ++-- > include/linux/huge_mm.h | 17 ++-- > include/linux/hugetlb.h | 25 +++++ > include/linux/mm.h | 52 ++++++++++- > include/linux/mm_types.h | 18 ++-- > include/linux/swapops.h | 7 +- > kernel/fork.c | 6 +- > mm/Kconfig | 3 + > mm/huge_memory.c | 201 ++++++++++++++++++++++++----------------- > mm/hugetlb.c | 108 +++++++++++++--------- > mm/memcontrol.c | 10 +- > mm/memory.c | 21 +++-- > mm/mempolicy.c | 5 +- > mm/migrate.c | 14 +-- > mm/mmap.c | 3 +- > mm/mprotect.c | 4 +- > mm/oom_kill.c | 6 +- > mm/pgtable-generic.c | 16 ++-- > mm/rmap.c | 15 ++- > 26 files changed, 379 insertions(+), 221 deletions(-) > > -- > 1.8.4.rc3 > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>