When clearing a large region, or when the user explicitly specifies via FOLL_HINT_BULK that a call to get_user_pages() is part of a larger region, take the uncached path. One notable limitation is that this is only done when the underlying pages are huge or gigantic, even if a large region composed of PAGE_SIZE pages is being cleared. This is because uncached stores are generally weakly ordered and need some kind of store fence -- which would need to be done at PTE write granularity to avoid data leakage. This would be expensive enough that it would negate any performance advantage. Performance ==== System: Oracle E4-2C (2 nodes * 64 cores * 2 threads) (Milan) Processor: AMD EPYC 7J13 64-Core Memory: 2048 GB evenly split between nodes LLC-size: 32MB for each CCX (8-core * 2-threads) boost: 0, Microcode: 0xa001137, scaling-governor: performance System: Oracle X9-2 (2 nodes * 32 cores * 2 threads) (Icelake) Processor: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz Memory: 512 GB evenly split between nodes LLC-size: 48MB for each node (32-cores * 2-threads) no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance Workload: qemu-VM-create == Create a large VM, backed by preallocated 2MB pages. (This test needs a minor change in qemu so it mmap's with MAP_POPULATE instead of demand faulting each page.) Milan, sz=1550 GB, runs=3 BW stdev diff ---------- ------ -------- baseline (clear_page_erms) 8.05 GBps 0.08 CLZERO (clear_page_clzero) 29.94 GBps 0.31 +271.92% (VM creation time decreases from 192.6s to 51.7s.) Icelake, sz=200 GB, runs=3 BW stdev diff ---------- ------ --------- baseline (clear_page_erms) 8.25 GBps 0.05 MOVNT (clear_page_movnt) 21.55 GBps 0.31 +161.21% (VM creation time decreases from 25.2s to 9.3s.) As the diff shows, for both these micro-architectures there's a significant speedup with the CLZERO and MOVNT based interfaces. Workload: Kernel build with background clear_huge_page() == Probe the cache-pollution aspect of this commit with a kernel build (make -j 15 bzImage) alongside a background clear_huge_page() load which does mmap(length=64GB, flags=MAP_POPULATE|MAP_HUGE_2MB) in a loop. The expectation -- assuming the kernel build performance is partly cache limited -- is that the background load of clear_page_erms() should show a greater slowdown, than clear_page_movnt() or clear_page_clzero(). The build itself does not use THP or similar, so any performance changes are due to the background load. # Milan, compile.sh internally tasksets to a CCX # perf stat -r 5 -e task-clock -e cycles -e stalled-cycles-frontend \ -e stalled-cycles-backend -e instructions -e branches \ -e branch-misses -e L1-dcache-loads -e L1-dcache-load-misses \ -e cache-references -e cache-misses -e all_data_cache_accesses \ -e l1_data_cache_fills_all -e l1_data_cache_fills_from_memory \ ./compile.sh Milan kernel-build[1] kernel-build[2] kernel-build[3] (bg: nothing) (bg:clear_page_erms()) (bg:clear_page_clzero()) ----------------- --------------------- ---------------------- ------------------------ run time 280.12s (+- 0.59%) 322.21s (+- 0.26%) 307.02s (+- 1.35%) IPC 1.16 1.05 1.14 backend-idle 3.78% (+- 0.06%) 4.62% (+- 0.11%) 3.87% (+- 0.10%) cache-misses 20.08% (+- 0.14%) 20.88% (+- 0.13%) 20.09% (+- 0.11%) (% of cache-refs) l1_data_cache_fills- 2.77M/sec (+- 0.20%) 3.11M/sec (+- 0.32%) 2.73M/sec (+- 0.12%) _from_memory >From the backend-idle stats in [1], the kernel build is only mildly memory subsystem bound. However, there's a small but clear effect where the background load of clear_page_clzero() does not leave much of an imprint on the kernel-build in [3] -- both [1] and [3] have largely similar IPC, memory and cache behaviour. OTOH, the clear_page_erms() workload in [2] constrains the kernel-build more. (Fuller perf stat output, at [1], [2], [3].) # Icelake, compile.sh internally tasksets to a socket # perf stat -r 5 -e task-clock -e cycles -e stalled-cycles-frontend \ -e stalled-cycles-backend -e instructions -e branches \ -e branch-misses -e L1-dcache-loads -e L1-dcache-load-misses \ -e cache-references -e cache-misses -e LLC-loads \ -e LLC-load-misses ./compile.sh Icelake kernel-build[4] kernel-build[5] kernel-build[6] (bg: nothing) (bg:clear_page_erms()) (bg:clear_page_movnt()) ----------------- ----------------- ---------------------- ----------------------- run time 135.47s (+- 0.25%) 136.75s (+- 0.23%) 135.65s (+- 0.15%) IPC 1.81 1.80 1.80 cache-misses 21.68% (+- 0.42%) 22.88% (+- 0.87%) 21.19% (+- 0.51%) (% of cache-refs) LLC-load-misses 35.56% (+- 0.83%) 37.44% (+- 0.99%) 33.54% (+- 1.17%) >From the LLC-load-miss and the cache-miss numbers, clear_page_erms() seems to cause some additional cache contention in the kernel-build in [5], compared to [4] and [6]. However, from the IPC and the run time numbers, looks like the CPU pipeline compensates for the extra misses quite well. (Increasing the number of make jobs to 60, did not change the overall picture appreciably.) (Fuller perf stat output, at [4], [5], [6].) [1] Milan, kernel-build Performance counter stats for './compile.sh' (5 runs): 2,525,721.45 msec task-clock # 9.016 CPUs utilized ( +- 0.06% ) 4,642,144,895,632 cycles # 1.838 GHz ( +- 0.01% ) (47.38%) 54,430,239,074 stalled-cycles-frontend # 1.17% frontend cycles idle ( +- 0.16% ) (47.35%) 175,620,521,760 stalled-cycles-backend # 3.78% backend cycles idle ( +- 0.06% ) (47.34%) 5,392,053,273,328 instructions # 1.16 insn per cycle # 0.03 stalled cycles per insn ( +- 0.02% ) (47.34%) 1,181,224,298,651 branches # 467.572 M/sec ( +- 0.01% ) (47.33%) 27,668,103,863 branch-misses # 2.34% of all branches ( +- 0.04% ) (47.33%) 2,141,384,087,286 L1-dcache-loads # 847.639 M/sec ( +- 0.01% ) (47.32%) 86,216,717,118 L1-dcache-load-misses # 4.03% of all L1-dcache accesses ( +- 0.08% ) (47.35%) 264,844,001,975 cache-references # 104.835 M/sec ( +- 0.03% ) (47.36%) 53,225,109,745 cache-misses # 20.086 % of all cache refs ( +- 0.14% ) (47.37%) 2,610,041,169,859 all_data_cache_accesses # 1.033 G/sec ( +- 0.01% ) (47.37%) 96,419,361,379 l1_data_cache_fills_all # 38.166 M/sec ( +- 0.06% ) (47.37%) 7,005,118,698 l1_data_cache_fills_from_memory # 2.773 M/sec ( +- 0.20% ) (47.38%) 280.12 +- 1.65 seconds time elapsed ( +- 0.59% ) [2] Milan, kernel-build (bg: clear_page_erms() workload) Performance counter stats for './compile.sh' (5 runs): 2,852,168.93 msec task-clock # 8.852 CPUs utilized ( +- 0.14% ) 5,166,249,772,084 cycles # 1.821 GHz ( +- 0.05% ) (47.27%) 62,039,291,151 stalled-cycles-frontend # 1.20% frontend cycles idle ( +- 0.04% ) (47.29%) 238,472,446,709 stalled-cycles-backend # 4.62% backend cycles idle ( +- 0.11% ) (47.30%) 5,419,530,293,688 instructions # 1.05 insn per cycle # 0.04 stalled cycles per insn ( +- 0.01% ) (47.31%) 1,186,958,893,481 branches # 418.404 M/sec ( +- 0.01% ) (47.31%) 28,106,023,654 branch-misses # 2.37% of all branches ( +- 0.03% ) (47.29%) 2,160,377,315,024 L1-dcache-loads # 761.534 M/sec ( +- 0.03% ) (47.26%) 89,101,836,173 L1-dcache-load-misses # 4.13% of all L1-dcache accesses ( +- 0.06% ) (47.25%) 276,859,144,248 cache-references # 97.593 M/sec ( +- 0.04% ) (47.22%) 57,774,174,239 cache-misses # 20.889 % of all cache refs ( +- 0.13% ) (47.24%) 2,641,613,011,234 all_data_cache_accesses # 931.170 M/sec ( +- 0.01% ) (47.22%) 99,595,968,133 l1_data_cache_fills_all # 35.108 M/sec ( +- 0.06% ) (47.24%) 8,831,873,628 l1_data_cache_fills_from_memory # 3.113 M/sec ( +- 0.32% ) (47.23%) 322.211 +- 0.837 seconds time elapsed ( +- 0.26% ) [3] Milan, kernel-build + (bg: clear_page_clzero() workload) Performance counter stats for './compile.sh' (5 runs): 2,607,387.17 msec task-clock # 8.493 CPUs utilized ( +- 0.14% ) 4,749,807,054,468 cycles # 1.824 GHz ( +- 0.09% ) (47.28%) 56,579,908,946 stalled-cycles-frontend # 1.19% frontend cycles idle ( +- 0.19% ) (47.28%) 183,367,955,020 stalled-cycles-backend # 3.87% backend cycles idle ( +- 0.10% ) (47.28%) 5,395,577,260,957 instructions # 1.14 insn per cycle # 0.03 stalled cycles per insn ( +- 0.02% ) (47.29%) 1,181,904,525,139 branches # 453.753 M/sec ( +- 0.01% ) (47.30%) 27,702,316,890 branch-misses # 2.34% of all branches ( +- 0.02% ) (47.31%) 2,137,616,885,978 L1-dcache-loads # 820.667 M/sec ( +- 0.01% ) (47.32%) 85,841,996,509 L1-dcache-load-misses # 4.02% of all L1-dcache accesses ( +- 0.03% ) (47.32%) 262,784,890,310 cache-references # 100.888 M/sec ( +- 0.04% ) (47.32%) 52,812,245,646 cache-misses # 20.094 % of all cache refs ( +- 0.11% ) (47.32%) 2,605,653,350,299 all_data_cache_accesses # 1.000 G/sec ( +- 0.01% ) (47.32%) 95,770,076,665 l1_data_cache_fills_all # 36.768 M/sec ( +- 0.03% ) (47.30%) 7,134,690,513 l1_data_cache_fills_from_memory # 2.739 M/sec ( +- 0.12% ) (47.29%) 307.02 +- 4.15 seconds time elapsed ( +- 1.35% ) [4] Icelake, kernel-build Performance counter stats for './compile.sh' (5 runs): 421,633 cs # 358.780 /sec ( +- 0.04% ) 1,173,522.36 msec task-clock # 8.662 CPUs utilized ( +- 0.14% ) 2,991,427,421,282 cycles # 2.545 GHz ( +- 0.15% ) (82.42%) 5,410,090,251,681 instructions # 1.81 insn per cycle ( +- 0.02% ) (91.13%) 1,189,406,048,438 branches # 1.012 G/sec ( +- 0.02% ) (91.05%) 21,291,454,717 branch-misses # 1.79% of all branches ( +- 0.02% ) (91.06%) 1,462,419,736,675 L1-dcache-loads # 1.244 G/sec ( +- 0.02% ) (91.06%) 47,084,269,809 L1-dcache-load-misses # 3.22% of all L1-dcache accesses ( +- 0.01% ) (91.05%) 23,527,140,332 cache-references # 20.020 M/sec ( +- 0.13% ) (91.04%) 5,093,132,060 cache-misses # 21.682 % of all cache refs ( +- 0.42% ) (91.03%) 4,220,672,439 LLC-loads # 3.591 M/sec ( +- 0.14% ) (91.04%) 1,501,704,609 LLC-load-misses # 35.56% of all LL-cache accesses ( +- 0.83% ) (73.10%) 135.478 +- 0.335 seconds time elapsed ( +- 0.25% ) [5] Icelake, kernel-build + (bg: clear_page_erms() workload) Performance counter stats for './compile.sh' (5 runs): 410,611 cs # 347.771 /sec ( +- 0.02% ) 1,184,382.84 msec task-clock # 8.661 CPUs utilized ( +- 0.08% ) 3,018,535,155,772 cycles # 2.557 GHz ( +- 0.08% ) (82.42%) 5,408,788,104,113 instructions # 1.80 insn per cycle ( +- 0.00% ) (91.13%) 1,189,173,209,515 branches # 1.007 G/sec ( +- 0.00% ) (91.05%) 21,279,087,578 branch-misses # 1.79% of all branches ( +- 0.01% ) (91.06%) 1,462,243,374,967 L1-dcache-loads # 1.238 G/sec ( +- 0.00% ) (91.05%) 47,210,704,159 L1-dcache-load-misses # 3.23% of all L1-dcache accesses ( +- 0.02% ) (91.04%) 23,378,470,958 cache-references # 19.801 M/sec ( +- 0.03% ) (91.05%) 5,339,921,426 cache-misses # 22.814 % of all cache refs ( +- 0.87% ) (91.03%) 4,241,388,134 LLC-loads # 3.592 M/sec ( +- 0.02% ) (91.05%) 1,588,055,137 LLC-load-misses # 37.44% of all LL-cache accesses ( +- 0.99% ) (73.09%) 136.750 +- 0.315 seconds time elapsed ( +- 0.23% ) [6] Icelake, kernel-build + (bg: clear_page_movnt() workload) Performance counter stats for './compile.sh' (5 runs): 409,978 cs # 347.850 /sec ( +- 0.06% ) 1,174,090.99 msec task-clock # 8.655 CPUs utilized ( +- 0.10% ) 2,992,914,428,930 cycles # 2.539 GHz ( +- 0.10% ) (82.40%) 5,408,632,560,457 instructions # 1.80 insn per cycle ( +- 0.00% ) (91.12%) 1,189,083,425,674 branches # 1.009 G/sec ( +- 0.00% ) (91.05%) 21,273,992,588 branch-misses # 1.79% of all branches ( +- 0.02% ) (91.05%) 1,462,081,591,012 L1-dcache-loads # 1.241 G/sec ( +- 0.00% ) (91.05%) 47,071,136,770 L1-dcache-load-misses # 3.22% of all L1-dcache accesses ( +- 0.03% ) (91.04%) 23,331,268,072 cache-references # 19.796 M/sec ( +- 0.05% ) (91.04%) 4,953,198,057 cache-misses # 21.190 % of all cache refs ( +- 0.51% ) (91.04%) 4,194,721,070 LLC-loads # 3.559 M/sec ( +- 0.10% ) (91.06%) 1,412,414,538 LLC-load-misses # 33.54% of all LL-cache accesses ( +- 1.17% ) (73.09%) 135.654 +- 0.203 seconds time elapsed ( +- 0.15% ) Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx> --- fs/hugetlbfs/inode.c | 7 ++++++- mm/gup.c | 20 ++++++++++++++++++++ mm/huge_memory.c | 2 +- mm/hugetlb.c | 9 ++++++++- 4 files changed, 35 insertions(+), 3 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index cdfb1ae78a3f..44cee9d30035 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -636,6 +636,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, loff_t hpage_size = huge_page_size(h); unsigned long hpage_shift = huge_page_shift(h); pgoff_t start, index, end; + bool hint_uncached; int error; u32 hash; @@ -653,6 +654,9 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, start = offset >> hpage_shift; end = (offset + len + hpage_size - 1) >> hpage_shift; + /* Don't pollute the cache if we are fallocte'ing a large region. */ + hint_uncached = clear_page_prefer_uncached((end - start) << hpage_shift); + inode_lock(inode); /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */ @@ -731,7 +735,8 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, error = PTR_ERR(page); goto out; } - clear_huge_page(page, addr, pages_per_huge_page(h)); + clear_huge_page(page, addr, pages_per_huge_page(h), + hint_uncached); __SetPageUptodate(page); error = huge_add_to_page_cache(page, mapping, index); if (unlikely(error)) { diff --git a/mm/gup.c b/mm/gup.c index 886d6148d3d0..930944e0c6eb 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -933,6 +933,13 @@ static int faultin_page(struct vm_area_struct *vma, */ fault_flags |= FAULT_FLAG_TRIED; } + if (*flags & FOLL_HINT_BULK) { + /* + * From the user hint, we might be faulting-in a large region + * so minimize cache-pollution. + */ + fault_flags |= FAULT_FLAG_UNCACHED; + } ret = handle_mm_fault(vma, address, fault_flags, NULL); if (ret & VM_FAULT_ERROR) { @@ -1100,6 +1107,19 @@ static long __get_user_pages(struct mm_struct *mm, if (!(gup_flags & FOLL_FORCE)) gup_flags |= FOLL_NUMA; + /* + * Uncached page clearing is generally faster when clearing regions + * sized ~LLC/2 or thereabouts. So hint the uncached path based + * on clear_page_prefer_uncached(). + * + * Note, however that this get_user_pages() call might end up + * needing to clear an extent smaller than nr_pages when we have + * taken the (potentially slower) uncached path based on the + * expectation of a larger nr_pages value. + */ + if (clear_page_prefer_uncached(nr_pages * PAGE_SIZE)) + gup_flags |= FOLL_HINT_BULK; + do { struct page *page; unsigned int foll_flags = gup_flags; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ffd4b07285ba..2d239967a8a1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -600,7 +600,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, pgtable_t pgtable; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; vm_fault_t ret = 0; - bool uncached = false; + bool uncached = vmf->flags & FAULT_FLAG_UNCACHED; VM_BUG_ON_PAGE(!PageCompound(page), page); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a920b1133cdb..35b643df5854 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4874,7 +4874,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, spinlock_t *ptl; unsigned long haddr = address & huge_page_mask(h); bool new_page, new_pagecache_page = false; - bool uncached = false; + bool uncached = flags & FAULT_FLAG_UNCACHED; /* * Currently, we are forced to kill the process in the event the @@ -5503,6 +5503,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, */ fault_flags |= FAULT_FLAG_TRIED; } + if (flags & FOLL_HINT_BULK) { + /* + * From the user hint, we might be faulting-in a large + * region so minimize cache-pollution. + */ + fault_flags |= FAULT_FLAG_UNCACHED; + } ret = hugetlb_fault(mm, vma, vaddr, fault_flags); if (ret & VM_FAULT_ERROR) { err = vm_fault_to_errno(ret, flags); -- 2.29.2