Re: [PATCH v5] mm: introduce reference pages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jul 16, 2021 at 07:57:57PM -0700, Peter Collingbourne wrote:
> Introduce a new syscall, refpage_create, which returns a file
> descriptor which may be mapped using mmap. Such a mapping is similar
> to an anonymous mapping, but instead of clean pages being backed by the
> zero page, they are instead backed by a so-called reference page, whose
> contents are specified using an argument to refpage_create. Loads from
> the mapping will load directly from the reference page, and initial
> stores to the mapping will copy-on-write from the reference page.

I think you mentioned in an earlier version, we can't use memfd_create()
as we want a single reference page to back arbitrarily sized mmap()
regions. Random API thought: could we instead add an MFD_MMAP_WRAP flag
or something to memfd_create() which pretends on mmap(MAP_PRIVATE) that
it's an infinite file and just replicates the data it already has in
that fd on read? On write, it would do a CoW.

Not sure it's any better than your proposal, just wondering whether we
can contain it to shmem_fault() mostly and maybe a new
shmem_map_pages(). Anyway, more API design comments below.

> Reference pages are useful in circumstances where anonymous mappings
> combined with manual stores to memory would impose undesirable costs,
> either in terms of performance or RSS. Use cases are focused on heap
> allocators and include:
> 
> - Pattern initialization for the heap. This is where malloc(3) gives
>   you memory whose contents are filled with a non-zero pattern
>   byte, in order to help detect and mitigate bugs involving use
>   of uninitialized memory. Typically this is implemented by having
>   the allocator memset the allocation with the pattern byte before
>   returning it to the user, but for large allocations this can result
>   in a significant increase in RSS, especially for allocations that
>   are used sparsely. Even for dense allocations there is a needless
>   impact to startup performance when it may be better to amortize it
>   throughout the program. By creating allocations using a reference
>   page filled with the pattern byte, we can avoid these costs.
> 
> - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
>   feature which allows for memory to be tagged in order to detect
>   certain kinds of memory errors with low overhead. In order to set
>   up an allocation to allow memory errors to be detected, the entire
>   allocation needs to have the same tag. The issue here is similar to
>   pattern initialization in the sense that large tagged allocations
>   will be expensive if the tagging is done up front. The idea is that
>   the allocator would create reference pages with each of the possible
>   memory tags, and use those reference pages for the large allocations.

So for MTE you'd need the same tag other than tag 0 (which is what
mmap() gives you). An alternative here is to allow passing a tag hint to
mmap() with the kernel allocating and maintaining a reference page (we
had some non-optimal internal attempts at this). But if the first case
(data pattern initialisation) is equally important, a reference pattern
may be a better option.

> This patch includes specific optimizations for these use cases in
> order to reduce memory traffic. If the reference page consists of a
> single repeating byte, the page is initialized using memset, and on
> arm64 if the reference page consists of a uniformly tagged zero page,
> the DC GZVA instruction is used to initialize the page.

I don't particularly like such heuristics. If the memset/single-tag is
your main use-case, just bake it into the API from the start. As David
said, how important is the read prior to a write of such page? I suspect
it's not relevant at all, so we could have the page allocated and
populated on fault (even a read fault). The pattern and its size could
be stored in the kernel as I suspect even if you need something other
than a single byte, it would still be a repetitive pattern.

Also please split this into multiple patches as it's hard to follow.
Something like (1) refpage_create, (2) basic arm64 tagging support, (3)
optimisation for specific cases etc.

> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index 36f51b0e438a..876997703d7b 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -450,3 +450,27 @@ int mte_ptrace_copy_tags(struct task_struct *child, long request,
>  
>  	return ret;
>  }
> +
> +u8 mte_check_tag_zero_page(struct page *userpage)
> +{
> +	char *userpage_addr = page_address(userpage);
> +	u8 tag;
> +	int i;
> +
> +	if (!test_bit(PG_mte_tagged, &userpage->flags))
> +		return 0;
> +
> +	tag = mte_get_mem_tag(userpage_addr) & 0xF;
> +	if (tag == 0)
> +		return 0;
> +
> +	for (i = 0; i != PAGE_SIZE; ++i)
> +		if (userpage_addr[i] != 0)
> +			return 0;
> +
> +	for (i = MTE_GRANULE_SIZE; i != PAGE_SIZE; i += MTE_GRANULE_SIZE)
> +		if ((mte_get_mem_tag(userpage_addr + i) & 0xF) != tag)
> +			return 0;
> +
> +	return tag;
> +}

What's this function for? I couldn't see a user in this patch (maybe I
didn't look harder).

> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 349c488765ca..73fe728787f0 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -25,6 +25,7 @@
>  #include <linux/perf_event.h>
>  #include <linux/preempt.h>
>  #include <linux/hugetlb.h>
> +#include <linux/mman.h>
>  
>  #include <asm/acpi.h>
>  #include <asm/bug.h>
> @@ -939,9 +940,52 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
>  	return alloc_page_vma(flags, vma, vaddr);
>  }
>  
> -void tag_clear_highpage(struct page *page)
> +void tag_set_highpage(struct page *page, unsigned long tag)
>  {
> -	mte_zero_clear_page_tags(page_address(page));
> +	unsigned long addr = (unsigned long)page_address(page);
> +
> +	addr &= ~MTE_TAG_MASK;
> +	addr |= tag << MTE_TAG_SHIFT;
> +	mte_zero_set_page_tags((void *)addr);
>  	page_kasan_tag_reset(page);
>  	set_bit(PG_mte_tagged, &page->flags);
>  }
> +
> +void arch_prep_refpage_private_data(struct refpage_private_data *priv,
> +				    char *content_kaddr, unsigned long size)
> +{
> +	if (system_supports_mte()) {
> +		u64 *content_kaddr_u64 = (u64 *)content_kaddr;
> +
> +		if (size == MTE_GRANULE_SIZE && content_kaddr_u64[0] == 0 &&
> +		    content_kaddr_u64[1] == 0) {
> +			priv->content_type = REFPAGE_CONTENT_TYPE_MTE_TAGGED;
> +			priv->content_info = mte_get_mem_tag(content_kaddr);
> +		}
> +
> +		set_bit(PG_mte_tagged, &priv->refpage->flags);

As per your example, the reference page may be in the data section. We
explicitly avoid tagged pages in there (could be a file mmap, not
necessarily bss) but you just force it here. I'd rather skip the tagging
altogether if the page was not mapped with PROT_MTE.

> +		if (size >= MTE_GRANULE_SIZE) {
> +			char *refpage_kaddr = page_address(priv->refpage);
> +			unsigned long i;
> +
> +			for (i = 0; i != PAGE_SIZE; i += size)
> +				mte_memcpy(refpage_kaddr + i, content_kaddr,
> +					   size);
> +			return;
> +		}
> +		mte_clear_page_tags(page_address(priv->refpage));
> +	}
> +
> +	prep_refpage_private_data(priv, content_kaddr, size);
> +}

IIUC this function writes to the reference page provided by the user but
get_user_pages() did not use a FOLL_WRITE. Even worse, the
refpage_create() syscall does not seem to reject size != PAGE_SIZE
values, so the user may think it's fine to have a shorter buffer with
the kernel overriding whatever is around it. If the API states that the
buffer must be PAGE_SIZE, why not get the user to populate it and avoid
this arch_prep_refpage_private_data() altogether. I think it
overcomplicates the kernel for something the user could do.

> +
> +void arch_copy_refpage(struct page *page, unsigned long addr,
> +		       struct vm_area_struct *vma)
> +{
> +	struct refpage_private_data *priv = vma->vm_private_data;
> +
> +	if (priv->content_type == REFPAGE_CONTENT_TYPE_MTE_TAGGED)
> +		tag_set_highpage(page, priv->content_info);
> +	else
> +		copy_refpage(page, addr, vma);
> +}

Another heuristics targeted at your specific allocator. Just leave it as
a later optimisation or at least a separate patch.

> diff --git a/mm/refpage.c b/mm/refpage.c
> new file mode 100644
> index 000000000000..13c0930dc926
> --- /dev/null
> +++ b/mm/refpage.c
> @@ -0,0 +1,151 @@
[...]
> +SYSCALL_DEFINE3(refpage_create, const void *__user, content, unsigned int,
> +		size, unsigned long, flags)
> +{
> +	unsigned long content_addr = (unsigned long)content;
> +	char *content_kaddr;
> +	struct page *userpage;
> +	struct refpage_private_data *private_data;
> +	long retval;
> +
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	if (!is_power_of_2(size) || size > PAGE_SIZE ||
> +	    (content_addr & (size - 1)) != 0)
> +		return -EINVAL;

Alternative: copy the pattern from the user content to a kernel buffer
and in the file ops .mmap, instead of vma_set_anonymous(), just
implement your own vma ops with a .fault callback that takes care of
copying the pattern to a newly allocated page on any fault.

> +
> +	if (get_user_pages(content_addr & ~(PAGE_SIZE - 1), 1, 0, &userpage,
> +			   0) != 1)
> +		return -EFAULT;

FOLL_WRITE to the gup flags?

-- 
Catalin



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux