Introduce a new mmap flag, MAP_REFPAGE, that creates a mapping similar to an anonymous mapping, but instead of clean pages being backed by the zero page, they are instead backed by a so-called reference page, whose address is specified using the offset argument to mmap. Loads from the mapping will load directly from the reference page, and initial stores to the mapping will copy-on-write from the reference page. Reference pages are useful in circumstances where anonymous mappings combined with manual stores to memory would impose undesirable costs, either in terms of performance or RSS. Use cases are focused on heap allocators and include: - Pattern initialization for the heap. This is where malloc(3) gives you memory whose contents are filled with a non-zero pattern byte, in order to help detect and mitigate bugs involving use of uninitialized memory. Typically this is implemented by having the allocator memset the allocation with the pattern byte before returning it to the user, but for large allocations this can result in a significant increase in RSS, especially for allocations that are used sparsely. Even for dense allocations there is a needless impact to startup performance when it may be better to amortize it throughout the program. By creating allocations using a reference page filled with the pattern byte, we can avoid these costs. - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5 feature which allows for memory to be tagged in order to detect certain kinds of memory errors with low overhead. In order to set up an allocation to allow memory errors to be detected, the entire allocation needs to have the same tag. The issue here is similar to pattern initialization in the sense that large tagged allocations will be expensive if the tagging is done up front. The idea is that the allocator would create reference pages with each of the possible memory tags, and use those reference pages for the large allocations. In order to measure the performance and RSS impact of reference pages, a version of this patch backported to kernel version 4.14 was tested on a Pixel 4 together with a modified [2] version of the Scudo allocator that uses reference pages to implement pattern initialization. A PDFium test program was used to collect the measurements like so: $ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf $ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf and the median of 100 runs measurement was taken with three variants of the allocator: - "anon" is the baseline (no pattern init) - "memset" is with pattern init of allocator pages implemented by initializing anonymous pages with memset - "refpage" is with pattern init of allocator pages implemented by creating reference pages All three variants are measured using the patch that I linked. "anon" is without the patch, "refpage" is with the patch and "memset" is with the patch with "#if 0" in place of "#if 1" in linux.cpp. The measurements are as follows: Real time (s) Max RSS (KiB) anon 2.237081 107088 memset 2.252241 112180 refpage 2.251220 103504 We can see that real time for refpage is about the same or maybe slightly faster than memset. At this point it is unclear where the discrepancy in performance between anon and refpage comes from. The Pixel 4 kernel has transparent hugepages disabled so that can't be it. I wouldn't trust the RSS number for reference pages (with a test program that uses an anonymous page as a reference page, I saw the following output on dmesg: [75768.572560] BUG: Bad rss-counter state mm:00000000f1cdec59 idx:1 val:-2 [75768.572577] BUG: Bad rss-counter state mm:00000000f1cdec59 idx:3 val:2 indicating that I might not have implemented RSS accounting for reference pages correctly), but we see straight away an RSS impact of 5% for memset versus anon. Assuming that accounting for anonymous pages has been implemented correctly, we can expect the true RSS number for refpages to be similar to that which I measured for anon. As an alternative to extending mmap(2), I considered using userfaultfd to implement reference pages. However, after having taken a detailed look at the interface, it does not seem suitable to be used in the context of a general purpose allocator. For example, UFFD_FEATURE_FORK support would be required in order to correctly support fork(2) in a process that uses the allocator (although POSIX does not guarantee support for allocating after fork, many allocators including Scudo support it, and nothing stops the forked process from page faulting pre-existing allocations after forking anyway), but UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"), making it unsuitable for use in an allocator. Furthermore, even if the interface issues are resolved, I suspect (but have not measured) that the cost of the multiple context switches between kernel and userspace would be too high to be used in an allocator anyway. There are unresolved issues with this patch: - We need to decide on the semantics associated with remapping or unmapping the reference page. As currently implemented, the page is looked up by address on each page fault, and a segfault ensues if the address is not mapped. It may be better to have the mmap(2) call take a reference to the page (failing if not mapped) and the underlying vma so that future remappings or unmappings have no effect. - I have not yet looked at interaction with transparent hugepages. - We probably need to restrict which kinds of pages are supported as reference pages (probably only anonymous and file-backed pages). This is somewhat tied to the remapping semantics as we would need to decide what happens if a supported page is replaced with an unsupported page. - Finally, the accounting issues as previously mentioned. However, I am sending this first version of the patch in order to get early feedback on the idea and whether it is suitable to be added to the kernel. [1] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/enhancing-memory-safety [2] https://github.com/pcc/llvm-project/commit/a05f88aaebc7daf262d6885444d9845052026f4b Signed-off-by: Peter Collingbourne <pcc@xxxxxxxxxx> --- arch/mips/kernel/vdso.c | 2 +- include/linux/mm.h | 2 +- include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 46 +++++++++++++++++++++++--- 4 files changed, 45 insertions(+), 6 deletions(-) diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c index 242dc5e83847..403c00cc1ac3 100644 --- a/arch/mips/kernel/vdso.c +++ b/arch/mips/kernel/vdso.c @@ -101,7 +101,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) /* Map delay slot emulation page */ base = mmap_region(NULL, STACK_TOP, PAGE_SIZE, VM_READ | VM_EXEC | - VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC, + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC, 0, 0, NULL); if (IS_ERR_VALUE(base)) { ret = base; diff --git a/include/linux/mm.h b/include/linux/mm.h index 256e1bc83460..3b3efa2e3283 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2576,7 +2576,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf); + unsigned long refpage, struct list_head *uf); extern unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long pgoff, unsigned long *populate, struct list_head *uf); diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index f94f65d429be..f57552dcf99a 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -29,6 +29,7 @@ #define MAP_HUGETLB 0x040000 /* create a huge page mapping */ #define MAP_SYNC 0x080000 /* perform synchronous page faults for the mapping */ #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ +#define MAP_REFPAGE 0x200000 /* use the offset argument as a pointer to a reference page */ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ diff --git a/mm/mmap.c b/mm/mmap.c index d43cc3b0187c..d74d0963d460 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -47,6 +47,7 @@ #include <linux/pkeys.h> #include <linux/oom.h> #include <linux/sched/mm.h> +#include <linux/compat.h> #include <linux/uaccess.h> #include <asm/cacheflush.h> @@ -1371,6 +1372,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, struct mm_struct *mm = current->mm; vm_flags_t vm_flags; int pkey = 0; + unsigned long refpage = 0; *populate = 0; @@ -1441,6 +1443,16 @@ unsigned long do_mmap(struct file *file, unsigned long addr, if (mlock_future_check(mm, vm_flags, len)) return -EAGAIN; + if (flags & MAP_REFPAGE) { + refpage = pgoff << PAGE_SHIFT; + if (in_compat_syscall()) { + /* The offset argument may have been sign extended at some + * point, so we need to mask out the high bits. + */ + refpage &= 0xffffffff; + } + } + if (file) { struct inode *inode = file_inode(file); unsigned long flags_mask; @@ -1541,8 +1553,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, if (file && is_file_hugepages(file)) vm_flags |= VM_NORESERVE; } - - addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); + addr = mmap_region(file, addr, len, vm_flags, pgoff, refpage, uf); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) @@ -1557,7 +1568,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, struct file *file = NULL; unsigned long retval; - if (!(flags & MAP_ANONYMOUS)) { + if (!(flags & (MAP_ANONYMOUS | MAP_REFPAGE))) { audit_mmap_fd(fd, flags); file = fget(fd); if (!file) @@ -1684,9 +1695,33 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags) return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE; } +static vm_fault_t refpage_fault(struct vm_fault *vmf) +{ + struct page *page; + + if (get_user_pages((unsigned long)vmf->vma->vm_private_data, 1, 0, + &page, 0) != 1) + return VM_FAULT_SIGSEGV; + + vmf->page = page; + return VM_FAULT_LOCKED; +} + +static void refpage_close(struct vm_area_struct *vma) +{ + /* This function exists only to prevent is_mergeable_vma from allowing a + * reference page mapping to be merged with an anonymous mapping. + */ +} + +const struct vm_operations_struct refpage_vmops = { + .fault = refpage_fault, + .close = refpage_close, +}; + unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf) + unsigned long refpage, struct list_head *uf) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; @@ -1788,6 +1823,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr, error = shmem_zero_setup(vma); if (error) goto free_vma; + } else if (refpage) { + vma->vm_ops = &refpage_vmops; + vma->vm_private_data = (void *)refpage; } else { vma_set_anonymous(vma); } -- 2.28.0.163.g6104cc2f0b6-goog