[PATCH v2 6/9] x86/clear_huge_page: multi-page clearing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



clear_pages_rep(), clear_pages_erms() clear using string instructions.
While clearing extents of more than a single page, we can use these
more effectively by explicitly advertising the region-size to the
processor.

This can be used as a hint by the processor-uarch to optimize the
clearing (ex. to avoid polluting one or more levels of the data-cache.)

As a secondary benefit, string instructions are typically microcoded,
and so it's a good idea to amortize the cost of the decode across larger
regions.

Accordingly, clear_huge_page() now does huge-page clearing in three
parts: the neighbourhood of the faulting address, the left, and the
right region of the neighbourhood.

The local neighbourhood is cleared last to keep its cachelines hot.

Performance
==

Use mmap(MAP_HUGETLB) to demand fault a 128GB region (on the local
NUMA node):

Milan (EPYC 7J13, boost=1):

              mm/clear_huge_page   x86/clear_huge_page   change
                          (GB/s)                (GB/s)

  pg-sz=2MB                14.55                 19.29    +32.5%
  pg-sz=1GB                19.34                 49.60   +156.4%

Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, so we see a dropoff in cacheline-allocations for
pg-sz=1GB:

pg-sz=1GB:
    -23,088,001,347      cycles                    #    3.487 GHz                      ( +-  0.08% )  (35.68%)
    - 4,680,678,939      L1-dcache-loads           #  706.831 M/sec                    ( +-  0.02% )  (35.74%)
    - 2,150,395,280      L1-dcache-load-misses     #   45.93% of all L1-dcache accesses  ( +-  0.01% )  (35.74%)

    + 8,983,798,764      cycles                    #    3.489 GHz                      ( +-  0.05% )  (35.59%)
    +    18,294,725      L1-dcache-loads           #    7.104 M/sec                    ( +- 18.88% )  (35.78%)
    +     6,677,565      L1-dcache-load-misses     #   30.48% of all L1-dcache accesses  ( +- 20.72% )  (35.78%)

That's not the case with pg-sz=2MB, where we perform better but the
number of cacheline allocations remain the same:

pg-sz=2MB:
    -31,087,683,852      cycles                    #    3.494 GHz                      ( +-  0.17% )  (35.72%)
    - 4,898,684,886      L1-dcache-loads           #  550.627 M/sec                    ( +-  0.03% )  (35.71%)
    - 2,161,434,236      L1-dcache-load-misses     #   44.11% of all L1-dcache accesses  ( +-  0.01% )  (35.71%)

    +23,368,914,596      cycles                    #    3.480 GHz                      ( +-  0.27% )  (35.72%)
    + 4,481,808,430      L1-dcache-loads           #  667.382 M/sec                    ( +-  0.03% )  (35.71%)
    + 2,170,453,309      L1-dcache-load-misses     #   48.41% of all L1-dcache accesses  ( +-  0.06% )  (35.71%)


Icelakex (Platinum 8358, no_turbo=0):

              mm/clear_huge_page   x86/clear_huge_page   change
                          (GB/s)                (GB/s)

  pg-sz=2MB                 9.19                 12.94   +40.8%
  pg-sz=1GB                 9.36                 12.97   +38.5%

For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
see a drop in cycles but there's no drop in cacheline allocation.

Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
---
 arch/x86/mm/hugetlbpage.c | 54 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 5804bbae4f01..0b9f7a6dad93 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -148,6 +148,60 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 		return hugetlb_get_unmapped_area_topdown(file, addr, len,
 				pgoff, flags);
 }
+
+#ifndef CONFIG_HIGHMEM
+static void clear_contig_region(struct page *page, unsigned int npages)
+{
+	clear_pages(page_address(page), npages);
+}
+
+/*
+ * clear_huge_page(): multi-page clearing variant of clear_huge_page().
+ *
+ * Taking inspiration from the common code variant, we split the zeroing in
+ * three parts: left of the fault, right of the fault, and up to 5 pages
+ * in the immediate neighbourhood of the target page.
+ *
+ * Cleared in that order to keep cache lines of the target region hot.
+ *
+ * For gigantic pages, there is no expectation of cache locality so we do a
+ * straight zeroing.
+ */
+void clear_huge_page(struct page *page,
+		     unsigned long addr_hint, unsigned int pages_per_huge_page)
+{
+	unsigned long addr = addr_hint &
+		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
+	const long pgidx = (addr_hint - addr) / PAGE_SIZE;
+	const int first_pg = 0, last_pg = pages_per_huge_page - 1;
+	const int width = 2; /* pages cleared last on either side */
+	int sidx[3], eidx[3];
+	int i, n;
+
+	if (pages_per_huge_page > MAX_ORDER_NR_PAGES)
+		return clear_contig_region(page, pages_per_huge_page);
+
+	/*
+	 * Neighbourhood of the fault. Cleared at the end to ensure
+	 * it sticks around in the cache.
+	 */
+	n = 2;
+	sidx[n] = (pgidx - width) < first_pg ? first_pg : (pgidx - width);
+	eidx[n] = (pgidx + width) > last_pg  ? last_pg  : (pgidx + width);
+
+	sidx[0] = first_pg;	/* Region to the left of the fault */
+	eidx[0] = sidx[n] - 1;
+
+	sidx[1] = eidx[n] + 1;	/* Region to the right of the fault */
+	eidx[1] = last_pg;
+
+	for (i = 0; i <= 2; i++) {
+		if (eidx[i] >= sidx[i])
+			clear_contig_region(page + sidx[i],
+					    eidx[i] - sidx[i] + 1);
+	}
+}
+#endif /* CONFIG_HIGHMEM */
 #endif /* CONFIG_HUGETLB_PAGE */
 
 #ifdef CONFIG_X86_64
-- 
2.31.1





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux