This patch changes the way we decide whether or not to give out THPs to processes when they fault in pages. The way things are right now, touching one byte in a 2M chunk where no pages have been faulted in results in a process being handed a 2M hugepage, which, in some cases, is undesirable. The most common issue seems to arise when a process uses many cores to work on small portions of an allocated chunk of memory. Here are some results from a test that I wrote, which allocates memory in a way that doesn't benefit from the use of THPs: # echo always > /sys/kernel/mm/transparent_hugepage/enabled # perf stat -a -r 5 ./thp_pthread -C 0 -m 0 -c 64 -b 128g Performance counter stats for './thp_pthread -C 0 -m 0 -c 64 -b 128g' (5 runs): 61971685.470621 task-clock # 662.557 CPUs utilized ( +- 0.68% ) [100.00%] 200,365 context-switches # 0.000 M/sec ( +- 0.64% ) [100.00%] 94 CPU-migrations # 0.000 M/sec ( +- 3.76% ) [100.00%] 61,644 page-faults # 0.000 M/sec ( +- 0.00% ) 11,771,748,145,744 cycles # 0.190 GHz ( +- 0.78% ) [100.00%] 17,958,073,323,609 stalled-cycles-frontend # 152.55% frontend cycles idle ( +- 0.97% ) [100.00%] <not counted> stalled-cycles-backend 10,691,478,094,935 instructions # 0.91 insns per cycle # 1.68 stalled cycles per insn ( +- 0.66% ) [100.00%] 1,593,798,555,131 branches # 25.718 M/sec ( +- 0.62% ) [100.00%] 102,473,582 branch-misses # 0.01% of all branches ( +- 0.43% ) 93.534078104 seconds time elapsed ( +- 0.68% ) # echo never > /sys/kernel/mm/transparent_hugepage/enabled # perf stat -a -r 5 ./thp_pthread -C 0 -m 0 -c 64 -b 128g Performance counter stats for './thp_pthread -C 0 -m 0 -c 64 -b 128g' (5 runs): 50703784.027438 task-clock # 663.073 CPUs utilized ( +- 0.18% ) [100.00%] 162,324 context-switches # 0.000 M/sec ( +- 0.22% ) [100.00%] 91 CPU-migrations # 0.000 M/sec ( +- 9.22% ) [100.00%] 31,250,840 page-faults # 0.001 M/sec ( +- 0.00% ) 7,962,585,261,769 cycles # 0.157 GHz ( +- 0.21% ) [100.00%] 9,230,610,615,208 stalled-cycles-frontend # 115.92% frontend cycles idle ( +- 0.23% ) [100.00%] <not counted> stalled-cycles-backend 16,899,387,283,411 instructions # 2.12 insns per cycle # 0.55 stalled cycles per insn ( +- 0.16% ) [100.00%] 2,422,269,260,013 branches # 47.773 M/sec ( +- 0.16% ) [100.00%] 99,419,683 branch-misses # 0.00% of all branches ( +- 0.22% ) 76.467835263 seconds time elapsed ( +- 0.18% ) As you can see there's a significant performance increase when running this test with THP off. Here's a pointer to the test, for those who are interested: http://oss.sgi.com/projects/memtests/thp_pthread.tar.gz My proposed solution to the problem is to allow users to set a threshold at which THPs will be handed out. The idea here is that, when a user faults in a page in an area where they would usually be handed a THP, we pull 512 pages off the free list, as we would with a regular THP, but we only fault in single pages from that chunk, until the user has faulted in enough pages to pass the threshold we've set. Once they pass the threshold, we do the necessary work to turn our 512 page chunk into a proper THP. As it stands now, if the user tries to fault in pages from different nodes, we completely give up on ever turning a particular chunk into a THP, and just fault in the 4K pages as they're requested. We may want to make this tunable in the future (i.e. allow them to fault in from only 2 different nodes). This patch is still a work in progress, and it has a few known issues that I've yet to sort out: - Bad page state bug resulting from pages being added to the pagevecs improperly + This bug doesn't seem to hit when allocating small amounts of memory on 32 or less cores, but it becomes an issue on larger test runs. + I believe the best way to avoid this is to make sure we don't lru_cache_add any of the pages in our chunk until we decide whether or not we'll turn the chunk into a THP. Haven't quite gotten this working yet. - A few small accounting issues with some of the mm counters - Some spots are still pretty hacky, need to be cleaned up a bit Just to let people know, I've been doing most of my testing with the memscale test: http://oss.sgi.com/projects/memtests/thp_memscale.tar.gz The pthread test hits the first bug I mentioned here much more often, but the patch seems to be more stable when tested with memscale. I typically run something like this to test: # ./thp_memscale -C 0 -m 0 -c 32 -b 16m As you increase the amount of memory/number of cores, you become more likely to run into issues. Although there's still work to be done here, I wanted to get an early version of the patch out so that everyone could give their opinions/suggestions. The patch should apply cleanly to the 3.12 kernel. I'll rebase it as soon as some of the remaining issues have been sorted out, this will also mean changing over to the split PTL where appropriate. Signed-off-by: Alex Thorlton <athorlton@xxxxxxx> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Cc: Nate Zimmer <nzimmer@xxxxxxx> Cc: Cliff Wickman <cpw@xxxxxxx> Cc: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx> Cc: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> Cc: Rik van Riel <riel@xxxxxxxxxx> Cc: Wanpeng Li <liwanp@xxxxxxxxxxxxxxxxxx> Cc: Mel Gorman <mgorman@xxxxxxx> Cc: Michel Lespinasse <walken@xxxxxxxxxx> Cc: Benjamin LaHaise <bcrl@xxxxxxxxx> Cc: Oleg Nesterov <oleg@xxxxxxxxxx> Cc: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> Cc: Andy Lutomirski <luto@xxxxxxxxxxxxxx> Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Cc: David Rientjes <rientjes@xxxxxxxxxx> Cc: Zhang Yanfei <zhangyanfei@xxxxxxxxxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> Cc: Michal Hocko <mhocko@xxxxxxx> Cc: Jiang Liu <jiang.liu@xxxxxxxxxx> Cc: Cody P Schafer <cody@xxxxxxxxxxxxxxxxxx> Cc: Glauber Costa <glommer@xxxxxxxxxxxxx> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> Cc: David Rientjes <rientjes@xxxxxxxxxx> Cc: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> Cc: linux-kernel@xxxxxxxxxxxxxxx Cc: linux-mm@xxxxxxxxx Alex Thorlton (3): Add flags for temporary compound pages Add tunable to control THP behavior Change THP behavior include/linux/gfp.h | 5 + include/linux/huge_mm.h | 8 ++ include/linux/mm_types.h | 14 +++ kernel/fork.c | 1 + mm/huge_memory.c | 313 +++++++++++++++++++++++++++++++++++++++++++++++ mm/internal.h | 1 + mm/memory.c | 29 ++++- mm/page_alloc.c | 66 +++++++++- 8 files changed, 430 insertions(+), 7 deletions(-) -- 1.7.12.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>