[RFC] mm, THP: Map read-only text segments using large THP pages

William Kucharski <william.kucharski@xxxxxxxxxx> · Mon, 14 May 2018 07:12:13 -0600

One of the downsides of THP as currently implemented is that it only supports
large page mappings for anonymous pages.

I embarked upon this prototype on the theory that it would be advantageous to 
be able to map large ranges of read-only text pages using THP as well.

The idea is that the kernel will attempt to allocate and map the range using a 
PMD sized THP page upon first fault; if the allocation is successful the page 
will be populated (at present using a call to kernel_read()) and the page will 
be mapped at the PMD level. If memory allocation fails, the page fault routines 
will drop through to the conventional PAGESIZE-oriented routines for mapping 
the faulting page.

Since this approach will map a PMD size block of the memory map at a time, we 
should see a slight uptick in time spent in disk I/O but a substantial drop in 
page faults as well as a reduction in iTLB misses as address ranges will be 
mapped with the larger page. Analysis of a test program that consists of a very 
large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows this 
does occur and there is a slight reduction in program execution time.

The text segment as seen from readelf:

LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
               0x000000001ccc19f0 0x000000001ccc19f0  R E    0x200000

As currently implemented for test purposes, the prototype will only use large 
pages to map an executable with a particular filename ("testr"), enabling easy 
comparison of the same executable using 4K and 2M (x64) pages on the same 
kernel. It is understood that this is just a proof of concept implementation 
and much more work regarding enabling the feature and overall system usage of 
it would need to be done before it was submitted as a kernel patch. However, I 
felt it would be worthy to send it out as an RFC so I can find out whether 
there are huge objections from the community to doing this at all, or a better 
understanding of the major concerns that must be assuaged before it would even 
be considered. I currently hardcode CONFIG_TRANSPARENT_HUGEPAGE to the 
equivalent of "always" and bypass some checks for anonymous pages by simply 
#ifdefing the code out; obviously I would need to determine the right thing to 
do in those cases.

Current comparisons of 4K vs 2M pages as generated by "perf stat -d -d -d -r10" 
follow; the 4K pagesize program was named "foo" and the 2M pagesize program 
"testr" (as noted above) - please note that these numbers do vary from run to 
run, but the orders of magnitude of the differences between the two versions 
remain relatively constant:

4K Pages:
=========
Performance counter stats for './foo' (10 runs):

    307054.450421      task-clock:u (msec)       #    1.000 CPUs utilized            ( +-  0.21% )
                0      context-switches:u        #    0.000 K/sec
                0      cpu-migrations:u          #    0.000 K/sec
            7,728      page-faults:u             #    0.025 K/sec                    ( +-  0.00% )
1,401,295,823,265      cycles:u                  #    4.564 GHz                      ( +-  0.19% )  (30.77%)
  562,704,668,718      instructions:u            #    0.40  insn per cycle           ( +-  0.00% )  (38.46%)
   20,100,243,102      branches:u                #   65.461 M/sec                    ( +-  0.00% )  (38.46%)
        2,628,944      branch-misses:u           #    0.01% of all branches          ( +-  3.32% )  (38.46%)
  180,885,880,185      L1-dcache-loads:u         #  589.100 M/sec                    ( +-  0.00% )  (38.46%)
   40,374,420,279      L1-dcache-load-misses:u   #   22.32% of all L1-dcache hits    ( +-  0.01% )  (38.46%)
      232,184,583      LLC-loads:u               #    0.756 M/sec                    ( +-  1.48% )  (30.77%)
       23,990,082      LLC-load-misses:u         #   10.33% of all LL-cache hits     ( +-  1.48% )  (30.77%)
  <not supported>      L1-icache-loads:u
   74,897,499,234      L1-icache-load-misses:u                                       ( +-  0.00% )  (30.77%)
  180,990,026,447      dTLB-loads:u              #  589.440 M/sec                    ( +-  0.00% )  (30.77%)
          707,373      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   ( +-  4.62% )  (30.77%)
        5,583,675      iTLB-loads:u              #    0.018 M/sec                    ( +-  0.31% )  (30.77%)
    1,219,514,499      iTLB-load-misses:u        # 21840.71% of all iTLB cache hits  ( +-  0.01% )  (30.77%)
  <not supported>      L1-dcache-prefetches:u
  <not supported>      L1-dcache-prefetch-misses:u

307.093088771 seconds time elapsed                                          ( +-  0.20% )

2M Pages:
=========
Performance counter stats for './testr' (10 runs):

    289504.209769      task-clock:u (msec)       #    1.000 CPUs utilized            ( +-  0.19% )
                0      context-switches:u        #    0.000 K/sec
                0      cpu-migrations:u          #    0.000 K/sec
              598      page-faults:u             #    0.002 K/sec                    ( +-  0.03% )
1,323,835,488,984      cycles:u                  #    4.573 GHz                      ( +-  0.19% )  (30.77%)
  562,658,682,055      instructions:u            #    0.43  insn per cycle           ( +-  0.00% )  (38.46%)
   20,099,662,528      branches:u                #   69.428 M/sec                    ( +-  0.00% )  (38.46%)
        2,877,086      branch-misses:u           #    0.01% of all branches          ( +-  4.52% )  (38.46%)
  180,899,297,017      L1-dcache-loads:u         #  624.859 M/sec                    ( +-  0.00% )  (38.46%)
   40,209,140,089      L1-dcache-load-misses:u   #   22.23% of all L1-dcache hits    ( +-  0.00% )  (38.46%)
      135,968,232      LLC-loads:u               #    0.470 M/sec                    ( +-  1.56% )  (30.77%)
        6,704,890      LLC-load-misses:u         #    4.93% of all LL-cache hits     ( +-  1.92% )  (30.77%)
  <not supported>      L1-icache-loads:u
   74,955,673,747      L1-icache-load-misses:u                                       ( +-  0.00% )  (30.77%)
  180,987,794,366      dTLB-loads:u              #  625.165 M/sec                    ( +-  0.00% )  (30.77%)
              835      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   ( +- 14.35% )  (30.77%)
        6,386,207      iTLB-loads:u              #    0.022 M/sec                    ( +-  0.42% )  (30.77%)
       51,929,869      iTLB-load-misses:u        #  813.16% of all iTLB cache hits   ( +-  1.61% )  (30.77%)
  <not supported>      L1-dcache-prefetches:u
  <not supported>      L1-dcache-prefetch-misses:u

289.551551387 seconds time elapsed                                          ( +-  0.20% )

A check of /proc/meminfo with the test program running shows the large mappings:

ShmemPmdMapped:   471040 kB

FAQ:
====
Q: What kernel is the prototype based on?
A: 4.14.0-rc7

Q: What is the biggest issue you haven't addressed?
A: Given this is a prototype, there are many. Aside from the fact that I 
   only map large pages for an  executable of a specific name ("testr"), the 
   code must be integrated with large page size support in the page cache 
   as currently multiple iterations of an executable would each use their 
   own individually allocated THP pages and those pages filled with data 
   using kernel_read(), which  allows for performance characterization but 
   would never be acceptable for a production kernel.

   A good example of the large page support required is the ext4 support
   outlined in:

     https://www.mail-archive.com/linux-block@xxxxxxxxxxxxxxx/msg04012.html

   There also need to be configuration options to enable this code at all, 
   likely only for file systems that support large pages, and more 
   reasonable fixes for the assumptions that all large THP pages are 
   anonymous assertions in rmap.c (for the prototype I just "#if 0" them out.)

Q: Which processes get their text as large pages?
A: At this point with this implementation it's any process with a read-only
   text area of the proper size/alignment.

   An attempt is made to align the address for non-MAP_FIXED addresses.

   I do not make any attempt to move mappings that take up a majority of a 
   large page to a large page; I only map a large page if the address 
   aligns and the map size is larger than or equal to a large page.  

Q: Which architectures has this been tested on?
A: At present, only x64.

Q: How about architectures (ARM, for instance) with multiple large page 
   sizes that are reasonable for text mappings?
A: At present a "large page" is just PMD size; it would be possible with
   additional effort to allow for mapping using PUD-sized pages.

Q: What about the use of non-PMD large page sizes (on non-x86 architectures)?
A: I haven't looked into that; I don't have an answer as to how to best 
   map a page that wasn't sized to be a PMD or PUD.

Signed-off-by: William Kucharski <william.kucharski@xxxxxxxxxx>

===============================================================

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed113ea..f4ac381 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -146,8 +146,8 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
	if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
		return -EINVAL;

-	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
-	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
+	vma_len = (loff_t)(vma->vm_end - vma->vm_start);	/* length of VMA */
+	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);	/* add vma->vm_pgoff * PAGESIZE */
	/* check for overflow */
	if (len < vma_len)
		return -EINVAL;
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 87067d2..353bec8 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -80,13 +80,15 @@ extern struct kobj_attribute shmem_enabled_attr;
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define HPAGE_PMD_SHIFT PMD_SHIFT
-#define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
-#define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
-
-#define HPAGE_PUD_SHIFT PUD_SHIFT
-#define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
-#define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
+#define HPAGE_PMD_SHIFT		PMD_SHIFT
+#define HPAGE_PMD_SIZE		((1UL) << HPAGE_PMD_SHIFT)
+#define	HPAGE_PMD_OFFSET	(HPAGE_PMD_SIZE - 1)
+#define	HPAGE_PMD_MASK		(~(HPAGE_PMD_OFFSET))
+
+#define HPAGE_PUD_SHIFT		PUD_SHIFT
+#define HPAGE_PUD_SIZE		((1UL) << HPAGE_PUD_SHIFT)
+#define	HPAGE_PUD_OFFSET	(HPAGE_PUD_SIZE - 1)
+#define HPAGE_PUD_MASK		(~(HPAGE_PUD_OFFSET))

extern bool is_vma_temporary_stack(struct vm_area_struct *vma);

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1981ed6..7b61c92 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -445,6 +445,14 @@ subsys_initcall(hugepage_init);

static int __init setup_transparent_hugepage(char *str)
{
+#if 1
+	set_bit(TRANSPARENT_HUGEPAGE_FLAG,
+		&transparent_hugepage_flags);
+	clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+		  &transparent_hugepage_flags);
+	printk("THP permanently set ON\n");
+	return 1;
+#else
	int ret = 0;
	if (!str)
		goto out;
@@ -471,6 +479,7 @@ static int __init setup_transparent_hugepage(char *str)
	if (!ret)
		pr_warn("transparent_hugepage= cannot parse, ignored\n");
	return ret;
+#endif
}
__setup("transparent_hugepage=", setup_transparent_hugepage);

@@ -532,8 +541,11 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,

	if (addr)
		goto out;
+
+#if 0
	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
		goto out;
+#endif

	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
	if (addr)
diff --git a/mm/memory.c b/mm/memory.c
index a728bed..fc352d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3506,7 +3506,99 @@ late_initcall(fault_around_debugfs);
 * fault_around_pages() value (and therefore to page order).  This way it's
 * easier to guarantee that we don't cross page table boundaries.
 */
-static int do_fault_around(struct vm_fault *vmf)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static
+int do_fault_around_thp(struct vm_fault *vmf)
+{
+        struct file *file = vmf->vma->vm_file;
+	unsigned long address = vmf->address;
+	pgoff_t start_pgoff = vmf->pgoff;
+	pgoff_t end_pgoff;
+	int ret = VM_FAULT_FALLBACK;
+	int off;
+
+	/*
+	 * vmf->address will be the higher of (fault address & HPAGE_PMD_MASK)
+	 * or the start of the VMA.
+	 */
+	vmf->address = max((address & HPAGE_PMD_MASK), vmf->vma->vm_start);
+
+	/*
+	 * Not a candidate if the start address calculated above isnt properly
+	 * aligned
+	 */
+	if (vmf->address & HPAGE_PMD_OFFSET)
+		goto dfa_thp_out;
+
+	off = ((address - vmf->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+	start_pgoff -= off;
+
+	/*
+	 *  end_pgoff is either end of page table or end of vma
+	 *  or fault_around_pages() from start_pgoff, depending what is
+	 *  smallest.
+	 */
+	end_pgoff = start_pgoff -
+		((vmf->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+		PTRS_PER_PTE - 1;
+	end_pgoff = min3(end_pgoff, vma_pages(vmf->vma) + vmf->vma->vm_pgoff - 1,
+			start_pgoff + PTRS_PER_PTE - 1);
+
+	/*
+	 * Check to see if we could map this request with a large THP page
+	 * instead.
+	 */
+	if (((strncmp(file->f_path.dentry->d_name.name, "testr", 5) == 0)) &&
+		pmd_none(*vmf->pmd) &&
+		((end_pgoff - start_pgoff) >=
+		((HPAGE_PMD_SIZE >> PAGE_SHIFT) - 1))) {
+		struct page *page;
+
+		page = alloc_pages_vma(vmf->gfp_mask | __GFP_COMP |
+			__GFP_NORETRY, HPAGE_PMD_ORDER, vmf->vma,
+			vmf->address, numa_node_id(), 1);
+
+		if ((likely(page)) && (PageTransCompound(page))) {
+			ssize_t bytes_read;
+			void *pg_vaddr;
+
+			prep_transhuge_page(page);
+			pg_vaddr = page_address(page);
+
+			if (likely(pg_vaddr)) {
+				loff_t loff = (loff_t)
+					(start_pgoff << PAGE_SHIFT);
+				bytes_read = kernel_read(file, pg_vaddr,
+					HPAGE_PMD_SIZE, &loff);
+				VM_BUG_ON(bytes_read != HPAGE_PMD_SIZE);
+
+				smp_wmb(); /* See comment in __pte_alloc() */
+				ret = alloc_set_pte(vmf, NULL, page);
+
+				if (likely(ret == 0)) {
+					VM_BUG_ON_PAGE(pmd_none(*vmf->pmd),
+						page);
+					vmf->page = page;
+					ret = VM_FAULT_NOPAGE;
+					goto dfa_thp_out;
+				}
+			}
+
+			put_page(page);
+		}
+	}
+
+dfa_thp_out:
+	vmf->address = address;
+	VM_BUG_ON(vmf->pte != NULL);
+	return ret;
+}
+#endif	/* CONFIG_TRANSPARENT_HUGEPAGE */
+
+
+static
+int do_fault_around(struct vm_fault *vmf)
{
	unsigned long address = vmf->address, nr_pages, mask;
	pgoff_t start_pgoff = vmf->pgoff;
@@ -3566,6 +3658,21 @@ static int do_read_fault(struct vm_fault *vmf)
	struct vm_area_struct *vma = vmf->vma;
	int ret = 0;

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/*
+	 * Check to see if we could map this request with a large THP page
+	 * instead.
+	 */
+	if ((vma_pages(vmf->vma) >= PTRS_PER_PMD) &&
+		((strncmp(vmf->vma->vm_file->f_path.dentry->d_name.name,
+		"testr", 5)) == 0)) {
+			ret = do_fault_around_thp(vmf);
+
+			if (ret == VM_FAULT_NOPAGE)
+				return ret;
+	}
+#endif	/* CONFIG_TRANSPARENT_HUGEPAGE */
+
	/*
	 * Let's call ->map_pages() first and use ->fault() as fallback
	 * if page by the offset is not ready to be mapped (cold cache or
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506f..1c281d7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1327,6 +1327,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
	struct mm_struct *mm = current->mm;
	int pkey = 0;

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long thp_maywrite = VM_MAYWRITE;
+#endif
+
	*populate = 0;

	if (!len)
@@ -1361,7 +1365,32 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
	/* Obtain the address to map to. we verify (or select) it and ensure
	 * that it represents a valid section of the address space.
	 */
-	addr = get_unmapped_area(file, addr, len, pgoff, flags);
+
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/*
+	 *
+	 * If THP is enabled, and it's a read-only executable that is
+	 * MAP_PRIVATE mapped, call the appropriate thp function to perhaps get a
+	 * large page aligned virtual address, otherwise use the normal routine.
+	 * 
+	 * Note the THP routine will return a normal page size aligned start
+	 * address in some cases.
+	 */
+	if ((prot & PROT_READ) && (prot & PROT_EXEC) && (!(prot & PROT_WRITE)) &&
+		(len >= HPAGE_PMD_SIZE) && (flags & MAP_PRIVATE) &&
+		((!(flags & MAP_FIXED)) || (!(addr & HPAGE_PMD_OFFSET)))) {
+			addr = thp_get_unmapped_area(file, addr, len, pgoff,
+				flags);
+			if (addr && (!(addr & HPAGE_PMD_OFFSET)))
+				thp_maywrite = 0;
+	} else {
+#endif
+		addr = get_unmapped_area(file, addr, len, pgoff, flags);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	}
+#endif
+
	if (offset_in_page(addr))
		return addr;

@@ -1376,7 +1405,11 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
	 * of the memory object, so we don't do any here.
	 */
	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			mm->def_flags | VM_MAYREAD | thp_maywrite | VM_MAYEXEC;
+#else
			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
+#endif

	if (flags & MAP_LOCKED)
		if (!can_do_mlock())
diff --git a/mm/rmap.c b/mm/rmap.c
index b874c47..4fc24f8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1184,7 +1184,9 @@ void page_add_file_rmap(struct page *page, bool compound)
		}
		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
			goto out;
+#if 0
		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+#endif
		__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
	} else {
		if (PageTransCompound(page) && page_mapping(page)) {
@@ -1224,7 +1226,9 @@ static void page_remove_file_rmap(struct page *page, bool compound)
		}
		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
			goto out;
+#if 0
		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+#endif
		__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
	} else {
		if (!atomic_add_negative(-1, &page->_mapcount))