[I started writing a reply before I saw what Matthew said, so I decided to finish going through this patch...] On Fri, Aug 14, 2020 at 11:33 PM Peter Collingbourne <pcc@xxxxxxxxxx> wrote: > Introduce a new syscall, refpage_create, which returns a file > descriptor which may be mapped using mmap. Such a mapping is similar > to an anonymous mapping, but instead of clean pages being backed by the > zero page, they are instead backed by a so-called reference page, whose > contents are specified using an argument to refpage_create. Loads from > the mapping will load directly from the reference page, and initial > stores to the mapping will copy-on-write from the reference page. [...] > - Pattern initialization for the heap. This is where malloc(3) gives > you memory whose contents are filled with a non-zero pattern > byte, in order to help detect and mitigate bugs involving use > of uninitialized memory. Typically this is implemented by having > the allocator memset the allocation with the pattern byte before > returning it to the user, but for large allocations this can result > in a significant increase in RSS, especially for allocations that > are used sparsely. Even for dense allocations there is a needless > impact to startup performance when it may be better to amortize it > throughout the program. By creating allocations using a reference > page filled with the pattern byte, we can avoid these costs. > > - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5 > feature which allows for memory to be tagged in order to detect > certain kinds of memory errors with low overhead. In order to set > up an allocation to allow memory errors to be detected, the entire > allocation needs to have the same tag. The issue here is similar to > pattern initialization in the sense that large tagged allocations > will be expensive if the tagging is done up front. The idea is that > the allocator would create reference pages with each of the possible > memory tags, and use those reference pages for the large allocations. This means that you'll end up with one VMA per large heap object, instead of being able to put them all into one big VMA, right? > In order to measure the performance and RSS impact of reference pages, > a version of this patch backported to kernel version 4.14 was tested on > a Pixel 4 together with a modified [2] version of the Scudo allocator > that uses reference pages to implement pattern initialization. A > PDFium test program was used to collect the measurements like so: > > $ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf > $ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf > > and the median of 100 runs measurement was taken with three variants > of the allocator: > > - "anon" is the baseline (no pattern init) > - "memset" is with pattern init of allocator pages implemented by > initializing anonymous pages with memset For the memory tagging usecase, this would use something like the STZ2G instruction, which is specialized for zeroing and re-tagging memory at high speed, right? Would STZ2G be expected to be faster than a current memset() implementation? I don't know much about how the hardware for this stuff works, but I'm guessing that STZ2G _miiiiight_ be optimized to reduce the amount of data transmitted over the memory bus, or something like that? Also, for that memset() test, did you do that on a fresh VMA (meaning the memset() will constantly take page faults AFAIK), or did you do it on memory that had been written to before (which should AFAIK be a bit faster)? > - "refpage" is with pattern init of allocator pages implemented > by creating reference pages [...] > As an alternative to introducing this syscall, I considered using > userfaultfd to implement reference pages. However, after having taken > a detailed look at the interface, it does not seem suitable to be > used in the context of a general purpose allocator. For example, > UFFD_FEATURE_FORK support would be required in order to correctly > support fork(2) in a process that uses the allocator (although POSIX > does not guarantee support for allocating after fork, many allocators > including Scudo support it, and nothing stops the forked process from > page faulting pre-existing allocations after forking anyway), but > UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd > ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"), > making it unsuitable for use in an allocator. That part should be fairly easy to fix by hooking an ioctl command up to the ->read handler. [...] > @@ -3347,11 +3348,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > if (unlikely(pmd_trans_unstable(vmf->pmd))) > return 0; > > - /* Use the zero-page for reads */ > + /* Use the zero-page, or reference page if set, for reads */ > if (!(vmf->flags & FAULT_FLAG_WRITE) && > !mm_forbids_zeropage(vma->vm_mm)) { > - entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), > - vma->vm_page_prot)); > + unsigned long pfn; > + > + if (unlikely(refpage)) > + pfn = page_to_pfn(refpage); > + else > + pfn = my_zero_pfn(vmf->address); > + entry = pte_mkspecial(pfn_pte(pfn, vma->vm_page_prot)); If someone maps this thing with MAP_SHARED and PROT_READ|PROT_WRITE, will this create a writable special PTE, or am I missing something? [...] > diff --git a/mm/migrate.c b/mm/migrate.c > index 5053439be6ab..6e9246d09e95 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -2841,8 +2841,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, > pmd_t *pmdp; > pte_t *ptep; > > - /* Only allow populating anonymous memory */ > - if (!vma_is_anonymous(vma)) > + /* Only allow populating anonymous memory without a reference page */ > + if (!vma_is_anonymous(vma) || vma->private_data) > goto abort; > > pgdp = pgd_offset(mm, addr); > diff --git a/mm/refpage.c b/mm/refpage.c [...] > +static int refpage_mmap(struct file *file, struct vm_area_struct *vma) > +{ > + vma_set_anonymous(vma); I wonder whether it would make more sense to have your own vm_operations_struct and handle faults through its hooks instead of messing around in the generic code for this. > + vma->vm_private_data = vma->vm_file->private_data; > + return 0; > +} [...] > +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long, > + flags) > +{ > + unsigned long content_addr = (unsigned long)content; > + struct page *userpage, *refpage; > + int fd; > + > + if (flags != 0) > + return -EINVAL; > + > + refpage = alloc_page(GFP_KERNEL); GFP_USER, maybe? > + if (!refpage) > + return -ENOMEM; > + if ((content_addr & (PAGE_SIZE - 1)) != 0 || > + get_user_pages(content_addr, 1, 0, &userpage, 0) != 1) { > + put_page(refpage); > + return -EFAULT; > + } > + > + copy_highpage(refpage, userpage); > + put_page(userpage); Why not this instead? if (copy_from_user(page_address(refpage), content, PAGE_SIZE)) goto out_put_page; If that is because copy_highpage() is going to include some magic memory-tag-copying thing or so, this needs a comment. > + fd = anon_inode_getfd("[refpage]", &refpage_file_operations, refpage, > + O_RDONLY | O_CLOEXEC); > + if (fd < 0) > + put_page(refpage); > + > + return fd; > +}