On Mon, Aug 17, 2020 at 8:43 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > [I started writing a reply before I saw what Matthew said, so I > decided to finish going through this patch...] > > On Fri, Aug 14, 2020 at 11:33 PM Peter Collingbourne <pcc@xxxxxxxxxx> wrote: > > Introduce a new syscall, refpage_create, which returns a file > > descriptor which may be mapped using mmap. Such a mapping is similar > > to an anonymous mapping, but instead of clean pages being backed by the > > zero page, they are instead backed by a so-called reference page, whose > > contents are specified using an argument to refpage_create. Loads from > > the mapping will load directly from the reference page, and initial > > stores to the mapping will copy-on-write from the reference page. > [...] > > - Pattern initialization for the heap. This is where malloc(3) gives > > you memory whose contents are filled with a non-zero pattern > > byte, in order to help detect and mitigate bugs involving use > > of uninitialized memory. Typically this is implemented by having > > the allocator memset the allocation with the pattern byte before > > returning it to the user, but for large allocations this can result > > in a significant increase in RSS, especially for allocations that > > are used sparsely. Even for dense allocations there is a needless > > impact to startup performance when it may be better to amortize it > > throughout the program. By creating allocations using a reference > > page filled with the pattern byte, we can avoid these costs. > > > > - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5 > > feature which allows for memory to be tagged in order to detect > > certain kinds of memory errors with low overhead. In order to set > > up an allocation to allow memory errors to be detected, the entire > > allocation needs to have the same tag. The issue here is similar to > > pattern initialization in the sense that large tagged allocations > > will be expensive if the tagging is done up front. The idea is that > > the allocator would create reference pages with each of the possible > > memory tags, and use those reference pages for the large allocations. > > This means that you'll end up with one VMA per large heap object, > instead of being able to put them all into one big VMA, right? Yes, although in Scudo we create guard pages around each large allocation in order to catch OOB accesses and these correspond to their own VMAs, so we already unavoidably have 2-3 VMAs per allocation and a switch to reference pages wouldn't change anything. > > In order to measure the performance and RSS impact of reference pages, > > a version of this patch backported to kernel version 4.14 was tested on > > a Pixel 4 together with a modified [2] version of the Scudo allocator > > that uses reference pages to implement pattern initialization. A > > PDFium test program was used to collect the measurements like so: > > > > $ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf > > $ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf > > > > and the median of 100 runs measurement was taken with three variants > > of the allocator: > > > > - "anon" is the baseline (no pattern init) > > - "memset" is with pattern init of allocator pages implemented by > > initializing anonymous pages with memset > > For the memory tagging usecase, this would use something like the > STZ2G instruction, which is specialized for zeroing and re-tagging > memory at high speed, right? Would STZ2G be expected to be faster than > a current memset() implementation? I don't know much about how the > hardware for this stuff works, but I'm guessing that STZ2G _miiiiight_ > be optimized to reduce the amount of data transmitted over the memory > bus, or something like that? It's actually the DC GZVA instruction that is expected to be fastest for this use case, since it operates on a cache line at a time. We've switched to using that to clear PROT_MTE pages in the kernel. In v4 of this patch, I have developed optimizations that check whether the reference page is a uniformly tagged zero page and if that is the case, DC GZVA is used to reset it. The same technique is also used for pattern initialization (i.e. memset to a pattern byte if the page is uniform) which also helps with performance. > Also, for that memset() test, did you do that on a fresh VMA (meaning > the memset() will constantly take page faults AFAIK), or did you do it > on memory that had been written to before (which should AFAIK be a bit > faster)? I believe that it was using the default allocator settings, which for Scudo implies aggressive use of munmap/MADV_DONTNEED and using a fresh VMA for any new allocation. This is the best tradeoff for memory usage but not necessarily for performance, however it doesn't necessarily mean that we don't care at all about performance in this case. > > - "refpage" is with pattern init of allocator pages implemented > > by creating reference pages > [...] > > As an alternative to introducing this syscall, I considered using > > userfaultfd to implement reference pages. However, after having taken > > a detailed look at the interface, it does not seem suitable to be > > used in the context of a general purpose allocator. For example, > > UFFD_FEATURE_FORK support would be required in order to correctly > > support fork(2) in a process that uses the allocator (although POSIX > > does not guarantee support for allocating after fork, many allocators > > including Scudo support it, and nothing stops the forked process from > > page faulting pre-existing allocations after forking anyway), but > > UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd > > ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"), > > making it unsuitable for use in an allocator. > > That part should be fairly easy to fix by hooking an ioctl command up > to the ->read handler. > > [...] > > @@ -3347,11 +3348,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > > if (unlikely(pmd_trans_unstable(vmf->pmd))) > > return 0; > > > > - /* Use the zero-page for reads */ > > + /* Use the zero-page, or reference page if set, for reads */ > > if (!(vmf->flags & FAULT_FLAG_WRITE) && > > !mm_forbids_zeropage(vma->vm_mm)) { > > - entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), > > - vma->vm_page_prot)); > > + unsigned long pfn; > > + > > + if (unlikely(refpage)) > > + pfn = page_to_pfn(refpage); > > + else > > + pfn = my_zero_pfn(vmf->address); > > + entry = pte_mkspecial(pfn_pte(pfn, vma->vm_page_prot)); > > If someone maps this thing with MAP_SHARED and PROT_READ|PROT_WRITE, > will this create a writable special PTE, or am I missing something? It looks like we will return early here: /* File mapping without ->vm_ops ? */ if (vma->vm_flags & VM_SHARED) return VM_FAULT_SIGBUS; This also seems like it would be necessary to avoid letting the zero page be mapped read-write. > [...] > > diff --git a/mm/migrate.c b/mm/migrate.c > > index 5053439be6ab..6e9246d09e95 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -2841,8 +2841,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, > > pmd_t *pmdp; > > pte_t *ptep; > > > > - /* Only allow populating anonymous memory */ > > - if (!vma_is_anonymous(vma)) > > + /* Only allow populating anonymous memory without a reference page */ > > + if (!vma_is_anonymous(vma) || vma->private_data) > > goto abort; > > > > pgdp = pgd_offset(mm, addr); > > diff --git a/mm/refpage.c b/mm/refpage.c > [...] > > +static int refpage_mmap(struct file *file, struct vm_area_struct *vma) > > +{ > > + vma_set_anonymous(vma); > > I wonder whether it would make more sense to have your own > vm_operations_struct and handle faults through its hooks instead of > messing around in the generic code for this. I considered it, but this wouldn't be compatible with the vm_ops interface as currently exposed. As I mentioned in an earlier email: > I considered having reference page mappings continue to provide a > custom vm_ops, but this would require changes to the interface to > preserve the specialness of the reference page. For example, > vm_normal_page() would need to know to return null for the reference > page in order to prevent _mapcount from overflowing, which could > probably be done by adding a new interface to vm_ops, but that seemed > more complicated than changing the anonymous page code path. Of course, if we were to refactor anonymous mappings in the future so that they are implemented via vm_ops, it seems like it would be more appropriate to have this be implemented in the same way. > > + vma->vm_private_data = vma->vm_file->private_data; > > + return 0; > > +} > [...] > > +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long, > > + flags) > > +{ > > + unsigned long content_addr = (unsigned long)content; > > + struct page *userpage, *refpage; > > + int fd; > > + > > + if (flags != 0) > > + return -EINVAL; > > + > > + refpage = alloc_page(GFP_KERNEL); > > GFP_USER, maybe? I would say that the page we're allocating here is owned by the kernel, even though it's directly accessible (read-only) to userspace. In this regard, it's maybe similar to the zero page. > > + if (!refpage) > > + return -ENOMEM; > > > + if ((content_addr & (PAGE_SIZE - 1)) != 0 || > > + get_user_pages(content_addr, 1, 0, &userpage, 0) != 1) { > > + put_page(refpage); > > + return -EFAULT; > > + } > > + > > + copy_highpage(refpage, userpage); > > + put_page(userpage); > > Why not this instead? > > if (copy_from_user(page_address(refpage), content, PAGE_SIZE)) > goto out_put_page; > > If that is because copy_highpage() is going to include some magic > memory-tag-copying thing or so, this needs a comment. Yes, with MTE we will need to zero the tags on the page so that it can be used in PROT_MTE mappings even if the original page was not PROT_MTE. In v4, I ended up making this an arch hook so that all of the arch-specific stuff can go there. Peter