Re: [PATCH v3] mm: introduce reference pages

John Hubbard <jhubbard@xxxxxxxxxx> · Mon, 17 Aug 2020 19:31:39 -0700

On 8/14/20 2:33 PM, Peter Collingbourne wrote:
Introduce a new syscall, refpage_create, which returns a file
descriptor which may be mapped using mmap. Such a mapping is similar

Hi,

For new syscalls, I think we need to put linux-api on CC, at the very
least. Adding them now. This would likely need man page support as well.
I'll put linux-doc on Cc, too.

to an anonymous mapping, but instead of clean pages being backed by the
zero page, they are instead backed by a so-called reference page, whose
contents are specified using an argument to refpage_create. Loads from
the mapping will load directly from the reference page, and initial
stores to the mapping will copy-on-write from the reference page.

Reference pages are useful in circumstances where anonymous mappings
combined with manual stores to memory would impose undesirable costs,
either in terms of performance or RSS. Use cases are focused on heap
allocators and include:

- Pattern initialization for the heap. This is where malloc(3) gives
   you memory whose contents are filled with a non-zero pattern
   byte, in order to help detect and mitigate bugs involving use
   of uninitialized memory. Typically this is implemented by having
   the allocator memset the allocation with the pattern byte before
   returning it to the user, but for large allocations this can result
   in a significant increase in RSS, especially for allocations that
   are used sparsely. Even for dense allocations there is a needless
   impact to startup performance when it may be better to amortize it
   throughout the program. By creating allocations using a reference
   page filled with the pattern byte, we can avoid these costs.

- Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
   feature which allows for memory to be tagged in order to detect
   certain kinds of memory errors with low overhead. In order to set
   up an allocation to allow memory errors to be detected, the entire
   allocation needs to have the same tag. The issue here is similar to
   pattern initialization in the sense that large tagged allocations
   will be expensive if the tagging is done up front. The idea is that
   the allocator would create reference pages with each of the possible
   memory tags, and use those reference pages for the large allocations.

That is good information, and it belongs in a man page, and/or Documentation/.


In order to measure the performance and RSS impact of reference pages,
a version of this patch backported to kernel version 4.14 was tested on
a Pixel 4 together with a modified [2] version of the Scudo allocator
that uses reference pages to implement pattern initialization. A
PDFium test program was used to collect the measurements like so:

$ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf
$ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf

and the median of 100 runs measurement was taken with three variants
of the allocator:

- "anon" is the baseline (no pattern init)
- "memset" is with pattern init of allocator pages implemented by
   initializing anonymous pages with memset
- "refpage" is with pattern init of allocator pages implemented
   by creating reference pages

All three variants are measured using the patch that I linked. "anon"
is without the patch, "refpage" is with the patch and "memset" is
with a previous version of the patch [3] with "#if 0" in place of
"#if 1" in linux.cpp. The measurements are as follows:

           Real time (s)    Max RSS (KiB)
anon        2.237081         107088
memset      2.252241         112180
refpage     2.243786         107128

We can see that RSS for refpage is almost the same as anon, and real
time overhead is 44% that of memset.


Are some of the numbers stale, maybe? Try as I might, I cannot combine
anything above to come up with 44%. :)


As an alternative to introducing this syscall, I considered using
userfaultfd to implement reference pages. However, after having taken
a detailed look at the interface, it does not seem suitable to be
used in the context of a general purpose allocator. For example,
UFFD_FEATURE_FORK support would be required in order to correctly
support fork(2) in a process that uses the allocator (although POSIX
does not guarantee support for allocating after fork, many allocators
including Scudo support it, and nothing stops the forked process from
page faulting pre-existing allocations after forking anyway), but
UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
making it unsuitable for use in an allocator. Furthermore, even if
the interface issues are resolved, I suspect (but have not measured)
that the cost of the multiple context switches between kernel and
userspace would be too high to be used in an allocator anyway.


That whole blurb is good for a cover letter, and perhaps an "alternatives
considered" section in Documentation/. However, it should be omitted from
the patch commit description, IMHO.

...

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 467302056e17..a1dc07ff914a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -175,6 +175,13 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
  
  	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
  		return false;
+
+	/*
+	 * Transparent hugepages not currently supported for anonymous VMAs with
+	 * reference pages
+	 */
+	if (unlikely(vma->vm_private_data))


This should use a helper function, such as is_reference_page_vma(). Because the
assumption that "vma->vm_private_data means a reference page vma" is much too
fragile. More below.


+		return false;
  	return true;
  }
  
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e7602a3bcef1..ac375e398690 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3122,5 +3122,15 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping,
  
  extern int sysctl_nr_trim_pages;
  
+static inline int is_zero_or_refpage_pfn(struct vm_area_struct *vma,
+					 unsigned long pfn)
+{
+	if (is_zero_pfn(pfn))
+		return true;
+	if (unlikely(!vma->vm_ops && vma->vm_private_data))
+		return pfn == page_to_pfn((struct page *)vma->vm_private_data);

As foreshadowed above, this needs a helper function. And the criteria for
deciding that it's a reference page needs to be more robust than just "no vm_ops,
vm_private_data is set, and it matches my page". Needs some more decisive
information.

Maybe setting vm_ops to some new "refpage" ops would be the way to go, for that.

...
diff --git a/mm/migrate.c b/mm/migrate.c
index 5053439be6ab..6e9246d09e95 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2841,8 +2841,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
  	pmd_t *pmdp;
  	pte_t *ptep;
  
-	/* Only allow populating anonymous memory */
-	if (!vma_is_anonymous(vma))
+	/* Only allow populating anonymous memory without a reference page */
+	if (!vma_is_anonymous(vma) || vma->private_data)

Same thing here: helper function, instead of open-coding the assumption about
what makes a refpage vma.

...

+SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
+		flags)
+{
+	unsigned long content_addr = (unsigned long)content;
+	struct page *userpage, *refpage;
+	int fd;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	refpage = alloc_page(GFP_KERNEL);
+	if (!refpage)
+		return -ENOMEM;
+
+	if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
+	    get_user_pages(content_addr, 1, 0, &userpage, 0) != 1) {
+		put_page(refpage);
+		return -EFAULT;
+	}
+
+	copy_highpage(refpage, userpage);
+	put_page(userpage);
+
+	fd = anon_inode_getfd("[refpage]", &refpage_file_operations, refpage,
+			      O_RDONLY | O_CLOEXEC);

Seems like the flags argument should have an influence on these flags, rather
than hard-coding O_CLOEXEC, right?


thanks,
--
John Hubbard
NVIDIA