clear_pages_rep(), clear_pages_erms() clear using string instructions. While clearing extents of more than a single page, we can use these more effectively by explicitly advertising the region-size to the processor. This can be used as a hint by the processor-uarch to optimize the clearing (ex. to avoid polluting one or more levels of the data-cache.) As a secondary benefit, string instructions are typically microcoded, and so it's a good idea to amortize the cost of the decode across larger regions. Accordingly, clear_huge_page() now does huge-page clearing in three parts: the neighbourhood of the faulting address, the left, and the right region of the neighbourhood. The local neighbourhood is cleared last to keep its cachelines hot. Performance == Use mmap(MAP_HUGETLB) to demand fault a 128GB region (on the local NUMA node): Milan (EPYC 7J13, boost=1): mm/clear_huge_page x86/clear_huge_page change (GB/s) (GB/s) pg-sz=2MB 14.55 19.29 +32.5% pg-sz=1GB 19.34 49.60 +156.4% Milan uses a threshold of LLC-size (~32MB) for eliding cacheline allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB: pg-sz=1GB: -23,088,001,347 cycles # 3.487 GHz ( +- 0.08% ) (35.68%) - 4,680,678,939 L1-dcache-loads # 706.831 M/sec ( +- 0.02% ) (35.74%) - 2,150,395,280 L1-dcache-load-misses # 45.93% of all L1-dcache accesses ( +- 0.01% ) (35.74%) + 8,983,798,764 cycles # 3.489 GHz ( +- 0.05% ) (35.59%) + 18,294,725 L1-dcache-loads # 7.104 M/sec ( +- 18.88% ) (35.78%) + 6,677,565 L1-dcache-load-misses # 30.48% of all L1-dcache accesses ( +- 20.72% ) (35.78%) That's not the case with pg-sz=2MB, where we perform better but the number of cacheline allocations remain the same: pg-sz=2MB: -31,087,683,852 cycles # 3.494 GHz ( +- 0.17% ) (35.72%) - 4,898,684,886 L1-dcache-loads # 550.627 M/sec ( +- 0.03% ) (35.71%) - 2,161,434,236 L1-dcache-load-misses # 44.11% of all L1-dcache accesses ( +- 0.01% ) (35.71%) +23,368,914,596 cycles # 3.480 GHz ( +- 0.27% ) (35.72%) + 4,481,808,430 L1-dcache-loads # 667.382 M/sec ( +- 0.03% ) (35.71%) + 2,170,453,309 L1-dcache-load-misses # 48.41% of all L1-dcache accesses ( +- 0.06% ) (35.71%) Icelakex (Platinum 8358, no_turbo=0): mm/clear_huge_page x86/clear_huge_page change (GB/s) (GB/s) pg-sz=2MB 9.19 12.94 +40.8% pg-sz=1GB 9.36 12.97 +38.5% For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we see a drop in cycles but there's no drop in cacheline allocation. Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx> --- arch/x86/mm/hugetlbpage.c | 54 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index 5804bbae4f01..0b9f7a6dad93 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -148,6 +148,60 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr, return hugetlb_get_unmapped_area_topdown(file, addr, len, pgoff, flags); } + +#ifndef CONFIG_HIGHMEM +static void clear_contig_region(struct page *page, unsigned int npages) +{ + clear_pages(page_address(page), npages); +} + +/* + * clear_huge_page(): multi-page clearing variant of clear_huge_page(). + * + * Taking inspiration from the common code variant, we split the zeroing in + * three parts: left of the fault, right of the fault, and up to 5 pages + * in the immediate neighbourhood of the target page. + * + * Cleared in that order to keep cache lines of the target region hot. + * + * For gigantic pages, there is no expectation of cache locality so we do a + * straight zeroing. + */ +void clear_huge_page(struct page *page, + unsigned long addr_hint, unsigned int pages_per_huge_page) +{ + unsigned long addr = addr_hint & + ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1); + const long pgidx = (addr_hint - addr) / PAGE_SIZE; + const int first_pg = 0, last_pg = pages_per_huge_page - 1; + const int width = 2; /* pages cleared last on either side */ + int sidx[3], eidx[3]; + int i, n; + + if (pages_per_huge_page > MAX_ORDER_NR_PAGES) + return clear_contig_region(page, pages_per_huge_page); + + /* + * Neighbourhood of the fault. Cleared at the end to ensure + * it sticks around in the cache. + */ + n = 2; + sidx[n] = (pgidx - width) < first_pg ? first_pg : (pgidx - width); + eidx[n] = (pgidx + width) > last_pg ? last_pg : (pgidx + width); + + sidx[0] = first_pg; /* Region to the left of the fault */ + eidx[0] = sidx[n] - 1; + + sidx[1] = eidx[n] + 1; /* Region to the right of the fault */ + eidx[1] = last_pg; + + for (i = 0; i <= 2; i++) { + if (eidx[i] >= sidx[i]) + clear_contig_region(page + sidx[i], + eidx[i] - sidx[i] + 1); + } +} +#endif /* CONFIG_HIGHMEM */ #endif /* CONFIG_HUGETLB_PAGE */ #ifdef CONFIG_X86_64 -- 2.31.1