[CCing Kirill and fs-devel] On Mon 14-05-18 07:12:13, William Kucharski wrote: > One of the downsides of THP as currently implemented is that it only supports > large page mappings for anonymous pages. There is a support for shmem merged already. ext4 was next on the plan AFAIR but I haven't seen any patches and Kirill was busy with other stuff IIRC. > I embarked upon this prototype on the theory that it would be advantageous to > be able to map large ranges of read-only text pages using THP as well. Can the fs really support THP only for read mappings? What if those pages are to be shared in a writable mapping as well? In other words can this all work without a full THP support for a particular fs? Keeping the rest of the email for new CC. > The idea is that the kernel will attempt to allocate and map the range using a > PMD sized THP page upon first fault; if the allocation is successful the page > will be populated (at present using a call to kernel_read()) and the page will > be mapped at the PMD level. If memory allocation fails, the page fault routines > will drop through to the conventional PAGESIZE-oriented routines for mapping > the faulting page. > > Since this approach will map a PMD size block of the memory map at a time, we > should see a slight uptick in time spent in disk I/O but a substantial drop in > page faults as well as a reduction in iTLB misses as address ranges will be > mapped with the larger page. Analysis of a test program that consists of a very > large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows this > does occur and there is a slight reduction in program execution time. > > The text segment as seen from readelf: > > LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000 > 0x000000001ccc19f0 0x000000001ccc19f0 R E 0x200000 > > As currently implemented for test purposes, the prototype will only use large > pages to map an executable with a particular filename ("testr"), enabling easy > comparison of the same executable using 4K and 2M (x64) pages on the same > kernel. It is understood that this is just a proof of concept implementation > and much more work regarding enabling the feature and overall system usage of > it would need to be done before it was submitted as a kernel patch. However, I > felt it would be worthy to send it out as an RFC so I can find out whether > there are huge objections from the community to doing this at all, or a better > understanding of the major concerns that must be assuaged before it would even > be considered. I currently hardcode CONFIG_TRANSPARENT_HUGEPAGE to the > equivalent of "always" and bypass some checks for anonymous pages by simply > #ifdefing the code out; obviously I would need to determine the right thing to > do in those cases. > > Current comparisons of 4K vs 2M pages as generated by "perf stat -d -d -d -r10" > follow; the 4K pagesize program was named "foo" and the 2M pagesize program > "testr" (as noted above) - please note that these numbers do vary from run to > run, but the orders of magnitude of the differences between the two versions > remain relatively constant: > > 4K Pages: > ========= > Performance counter stats for './foo' (10 runs): > > 307054.450421 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.21% ) > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 7,728 page-faults:u # 0.025 K/sec ( +- 0.00% ) > 1,401,295,823,265 cycles:u # 4.564 GHz ( +- 0.19% ) (30.77%) > 562,704,668,718 instructions:u # 0.40 insn per cycle ( +- 0.00% ) (38.46%) > 20,100,243,102 branches:u # 65.461 M/sec ( +- 0.00% ) (38.46%) > 2,628,944 branch-misses:u # 0.01% of all branches ( +- 3.32% ) (38.46%) > 180,885,880,185 L1-dcache-loads:u # 589.100 M/sec ( +- 0.00% ) (38.46%) > 40,374,420,279 L1-dcache-load-misses:u # 22.32% of all L1-dcache hits ( +- 0.01% ) (38.46%) > 232,184,583 LLC-loads:u # 0.756 M/sec ( +- 1.48% ) (30.77%) > 23,990,082 LLC-load-misses:u # 10.33% of all LL-cache hits ( +- 1.48% ) (30.77%) > <not supported> L1-icache-loads:u > 74,897,499,234 L1-icache-load-misses:u ( +- 0.00% ) (30.77%) > 180,990,026,447 dTLB-loads:u # 589.440 M/sec ( +- 0.00% ) (30.77%) > 707,373 dTLB-load-misses:u # 0.00% of all dTLB cache hits ( +- 4.62% ) (30.77%) > 5,583,675 iTLB-loads:u # 0.018 M/sec ( +- 0.31% ) (30.77%) > 1,219,514,499 iTLB-load-misses:u # 21840.71% of all iTLB cache hits ( +- 0.01% ) (30.77%) > <not supported> L1-dcache-prefetches:u > <not supported> L1-dcache-prefetch-misses:u > > 307.093088771 seconds time elapsed ( +- 0.20% ) > > 2M Pages: > ========= > Performance counter stats for './testr' (10 runs): > > 289504.209769 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.19% ) > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 598 page-faults:u # 0.002 K/sec ( +- 0.03% ) > 1,323,835,488,984 cycles:u # 4.573 GHz ( +- 0.19% ) (30.77%) > 562,658,682,055 instructions:u # 0.43 insn per cycle ( +- 0.00% ) (38.46%) > 20,099,662,528 branches:u # 69.428 M/sec ( +- 0.00% ) (38.46%) > 2,877,086 branch-misses:u # 0.01% of all branches ( +- 4.52% ) (38.46%) > 180,899,297,017 L1-dcache-loads:u # 624.859 M/sec ( +- 0.00% ) (38.46%) > 40,209,140,089 L1-dcache-load-misses:u # 22.23% of all L1-dcache hits ( +- 0.00% ) (38.46%) > 135,968,232 LLC-loads:u # 0.470 M/sec ( +- 1.56% ) (30.77%) > 6,704,890 LLC-load-misses:u # 4.93% of all LL-cache hits ( +- 1.92% ) (30.77%) > <not supported> L1-icache-loads:u > 74,955,673,747 L1-icache-load-misses:u ( +- 0.00% ) (30.77%) > 180,987,794,366 dTLB-loads:u # 625.165 M/sec ( +- 0.00% ) (30.77%) > 835 dTLB-load-misses:u # 0.00% of all dTLB cache hits ( +- 14.35% ) (30.77%) > 6,386,207 iTLB-loads:u # 0.022 M/sec ( +- 0.42% ) (30.77%) > 51,929,869 iTLB-load-misses:u # 813.16% of all iTLB cache hits ( +- 1.61% ) (30.77%) > <not supported> L1-dcache-prefetches:u > <not supported> L1-dcache-prefetch-misses:u > > 289.551551387 seconds time elapsed ( +- 0.20% ) > > A check of /proc/meminfo with the test program running shows the large mappings: > > ShmemPmdMapped: 471040 kB > > FAQ: > ==== > Q: What kernel is the prototype based on? > A: 4.14.0-rc7 > > Q: What is the biggest issue you haven't addressed? > A: Given this is a prototype, there are many. Aside from the fact that I > only map large pages for an executable of a specific name ("testr"), the > code must be integrated with large page size support in the page cache > as currently multiple iterations of an executable would each use their > own individually allocated THP pages and those pages filled with data > using kernel_read(), which allows for performance characterization but > would never be acceptable for a production kernel. > > A good example of the large page support required is the ext4 support > outlined in: > > https://www.mail-archive.com/linux-block@xxxxxxxxxxxxxxx/msg04012.html > > There also need to be configuration options to enable this code at all, > likely only for file systems that support large pages, and more > reasonable fixes for the assumptions that all large THP pages are > anonymous assertions in rmap.c (for the prototype I just "#if 0" them out.) > > Q: Which processes get their text as large pages? > A: At this point with this implementation it's any process with a read-only > text area of the proper size/alignment. > > An attempt is made to align the address for non-MAP_FIXED addresses. > > I do not make any attempt to move mappings that take up a majority of a > large page to a large page; I only map a large page if the address > aligns and the map size is larger than or equal to a large page. > > Q: Which architectures has this been tested on? > A: At present, only x64. > > Q: How about architectures (ARM, for instance) with multiple large page > sizes that are reasonable for text mappings? > A: At present a "large page" is just PMD size; it would be possible with > additional effort to allow for mapping using PUD-sized pages. > > Q: What about the use of non-PMD large page sizes (on non-x86 architectures)? > A: I haven't looked into that; I don't have an answer as to how to best > map a page that wasn't sized to be a PMD or PUD. > > Signed-off-by: William Kucharski <william.kucharski@xxxxxxxxxx> > > =============================================================== > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index ed113ea..f4ac381 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -146,8 +146,8 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) > if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT)) > return -EINVAL; > > - vma_len = (loff_t)(vma->vm_end - vma->vm_start); > - len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); > + vma_len = (loff_t)(vma->vm_end - vma->vm_start); /* length of VMA */ > + len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); /* add vma->vm_pgoff * PAGESIZE */ > /* check for overflow */ > if (len < vma_len) > return -EINVAL; > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 87067d2..353bec8 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -80,13 +80,15 @@ extern struct kobj_attribute shmem_enabled_attr; > #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER) > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > -#define HPAGE_PMD_SHIFT PMD_SHIFT > -#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT) > -#define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1)) > - > -#define HPAGE_PUD_SHIFT PUD_SHIFT > -#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT) > -#define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1)) > +#define HPAGE_PMD_SHIFT PMD_SHIFT > +#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT) > +#define HPAGE_PMD_OFFSET (HPAGE_PMD_SIZE - 1) > +#define HPAGE_PMD_MASK (~(HPAGE_PMD_OFFSET)) > + > +#define HPAGE_PUD_SHIFT PUD_SHIFT > +#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT) > +#define HPAGE_PUD_OFFSET (HPAGE_PUD_SIZE - 1) > +#define HPAGE_PUD_MASK (~(HPAGE_PUD_OFFSET)) > > extern bool is_vma_temporary_stack(struct vm_area_struct *vma); > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 1981ed6..7b61c92 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -445,6 +445,14 @@ subsys_initcall(hugepage_init); > > static int __init setup_transparent_hugepage(char *str) > { > +#if 1 > + set_bit(TRANSPARENT_HUGEPAGE_FLAG, > + &transparent_hugepage_flags); > + clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, > + &transparent_hugepage_flags); > + printk("THP permanently set ON\n"); > + return 1; > +#else > int ret = 0; > if (!str) > goto out; > @@ -471,6 +479,7 @@ static int __init setup_transparent_hugepage(char *str) > if (!ret) > pr_warn("transparent_hugepage= cannot parse, ignored\n"); > return ret; > +#endif > } > __setup("transparent_hugepage=", setup_transparent_hugepage); > > @@ -532,8 +541,11 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, > > if (addr) > goto out; > + > +#if 0 > if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD)) > goto out; > +#endif > > addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE); > if (addr) > diff --git a/mm/memory.c b/mm/memory.c > index a728bed..fc352d8 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3506,7 +3506,99 @@ late_initcall(fault_around_debugfs); > * fault_around_pages() value (and therefore to page order). This way it's > * easier to guarantee that we don't cross page table boundaries. > */ > -static int do_fault_around(struct vm_fault *vmf) > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +static > +int do_fault_around_thp(struct vm_fault *vmf) > +{ > + struct file *file = vmf->vma->vm_file; > + unsigned long address = vmf->address; > + pgoff_t start_pgoff = vmf->pgoff; > + pgoff_t end_pgoff; > + int ret = VM_FAULT_FALLBACK; > + int off; > + > + /* > + * vmf->address will be the higher of (fault address & HPAGE_PMD_MASK) > + * or the start of the VMA. > + */ > + vmf->address = max((address & HPAGE_PMD_MASK), vmf->vma->vm_start); > + > + /* > + * Not a candidate if the start address calculated above isnt properly > + * aligned > + */ > + if (vmf->address & HPAGE_PMD_OFFSET) > + goto dfa_thp_out; > + > + off = ((address - vmf->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1); > + start_pgoff -= off; > + > + /* > + * end_pgoff is either end of page table or end of vma > + * or fault_around_pages() from start_pgoff, depending what is > + * smallest. > + */ > + end_pgoff = start_pgoff - > + ((vmf->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) + > + PTRS_PER_PTE - 1; > + end_pgoff = min3(end_pgoff, vma_pages(vmf->vma) + vmf->vma->vm_pgoff - 1, > + start_pgoff + PTRS_PER_PTE - 1); > + > + /* > + * Check to see if we could map this request with a large THP page > + * instead. > + */ > + if (((strncmp(file->f_path.dentry->d_name.name, "testr", 5) == 0)) && > + pmd_none(*vmf->pmd) && > + ((end_pgoff - start_pgoff) >= > + ((HPAGE_PMD_SIZE >> PAGE_SHIFT) - 1))) { > + struct page *page; > + > + page = alloc_pages_vma(vmf->gfp_mask | __GFP_COMP | > + __GFP_NORETRY, HPAGE_PMD_ORDER, vmf->vma, > + vmf->address, numa_node_id(), 1); > + > + if ((likely(page)) && (PageTransCompound(page))) { > + ssize_t bytes_read; > + void *pg_vaddr; > + > + prep_transhuge_page(page); > + pg_vaddr = page_address(page); > + > + if (likely(pg_vaddr)) { > + loff_t loff = (loff_t) > + (start_pgoff << PAGE_SHIFT); > + bytes_read = kernel_read(file, pg_vaddr, > + HPAGE_PMD_SIZE, &loff); > + VM_BUG_ON(bytes_read != HPAGE_PMD_SIZE); > + > + smp_wmb(); /* See comment in __pte_alloc() */ > + ret = alloc_set_pte(vmf, NULL, page); > + > + if (likely(ret == 0)) { > + VM_BUG_ON_PAGE(pmd_none(*vmf->pmd), > + page); > + vmf->page = page; > + ret = VM_FAULT_NOPAGE; > + goto dfa_thp_out; > + } > + } > + > + put_page(page); > + } > + } > + > +dfa_thp_out: > + vmf->address = address; > + VM_BUG_ON(vmf->pte != NULL); > + return ret; > +} > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > + > +static > +int do_fault_around(struct vm_fault *vmf) > { > unsigned long address = vmf->address, nr_pages, mask; > pgoff_t start_pgoff = vmf->pgoff; > @@ -3566,6 +3658,21 @@ static int do_read_fault(struct vm_fault *vmf) > struct vm_area_struct *vma = vmf->vma; > int ret = 0; > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + /* > + * Check to see if we could map this request with a large THP page > + * instead. > + */ > + if ((vma_pages(vmf->vma) >= PTRS_PER_PMD) && > + ((strncmp(vmf->vma->vm_file->f_path.dentry->d_name.name, > + "testr", 5)) == 0)) { > + ret = do_fault_around_thp(vmf); > + > + if (ret == VM_FAULT_NOPAGE) > + return ret; > + } > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > /* > * Let's call ->map_pages() first and use ->fault() as fallback > * if page by the offset is not ready to be mapped (cold cache or > diff --git a/mm/mmap.c b/mm/mmap.c > index 680506f..1c281d7 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1327,6 +1327,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > struct mm_struct *mm = current->mm; > int pkey = 0; > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + unsigned long thp_maywrite = VM_MAYWRITE; > +#endif > + > *populate = 0; > > if (!len) > @@ -1361,7 +1365,32 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > /* Obtain the address to map to. we verify (or select) it and ensure > * that it represents a valid section of the address space. > */ > - addr = get_unmapped_area(file, addr, len, pgoff, flags); > + > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + /* > + * > + * If THP is enabled, and it's a read-only executable that is > + * MAP_PRIVATE mapped, call the appropriate thp function to perhaps get a > + * large page aligned virtual address, otherwise use the normal routine. > + * > + * Note the THP routine will return a normal page size aligned start > + * address in some cases. > + */ > + if ((prot & PROT_READ) && (prot & PROT_EXEC) && (!(prot & PROT_WRITE)) && > + (len >= HPAGE_PMD_SIZE) && (flags & MAP_PRIVATE) && > + ((!(flags & MAP_FIXED)) || (!(addr & HPAGE_PMD_OFFSET)))) { > + addr = thp_get_unmapped_area(file, addr, len, pgoff, > + flags); > + if (addr && (!(addr & HPAGE_PMD_OFFSET))) > + thp_maywrite = 0; > + } else { > +#endif > + addr = get_unmapped_area(file, addr, len, pgoff, flags); > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + } > +#endif > + > if (offset_in_page(addr)) > return addr; > > @@ -1376,7 +1405,11 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > * of the memory object, so we don't do any here. > */ > vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + mm->def_flags | VM_MAYREAD | thp_maywrite | VM_MAYEXEC; > +#else > mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; > +#endif > > if (flags & MAP_LOCKED) > if (!can_do_mlock()) > diff --git a/mm/rmap.c b/mm/rmap.c > index b874c47..4fc24f8 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1184,7 +1184,9 @@ void page_add_file_rmap(struct page *page, bool compound) > } > if (!atomic_inc_and_test(compound_mapcount_ptr(page))) > goto out; > +#if 0 > VM_BUG_ON_PAGE(!PageSwapBacked(page), page); > +#endif > __inc_node_page_state(page, NR_SHMEM_PMDMAPPED); > } else { > if (PageTransCompound(page) && page_mapping(page)) { > @@ -1224,7 +1226,9 @@ static void page_remove_file_rmap(struct page *page, bool compound) > } > if (!atomic_add_negative(-1, compound_mapcount_ptr(page))) > goto out; > +#if 0 > VM_BUG_ON_PAGE(!PageSwapBacked(page), page); > +#endif > __dec_node_page_state(page, NR_SHMEM_PMDMAPPED); > } else { > if (!atomic_add_negative(-1, &page->_mapcount)) -- Michal Hocko SUSE Labs