Re: [PATCH v2] perf: map pages in advance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Nov 29, 2024 at 03:31:34PM +0000, Lorenzo Stoakes wrote:
> We are current refactoring struct page to make it smaller, removing
> unneeded fields that correctly belong to struct folio.
> 
> Two of those fields are page->index and page->mapping. Perf is currently
> making use of both of these, so this patch removes this usage as it turns
> out it is unnecessary.
> 
> Perf establishes its own internally controlled memory-mapped pages using
> vm_ops hooks. The first page in the mapping is the read/write user control
> page, and the rest of the mapping consists of read-only pages.
> 
> The VMA is backed by kernel memory either from the buddy allocator or
> vmalloc depending on configuration. It is intended to be mapped read/write,
> but because it has a page_mkwrite() hook, vma_wants_writenotify() indicaets
> that it should be mapped read-only.
> 
> When a write fault occurs, the provided page_mkwrite() hook,
> perf_mmap_fault() (doing double duty handing faults as well) uses the
> vmf->pgoff field to determine if this is the first page, allowing for the
> desired read/write first page, read-only rest mapping.
> 
> For this to work the implementation has to carefully work around faulting
> logic. When a page is write-faulted, the fault() hook is called first, then
> its page_mkwrite() hook is called (to allow for dirty tracking in file
> systems).
> 
> On fault we set the folio's mapping in perf_mmap_fault(), this is because
> when do_page_mkwrite() is subsequently invoked, it treats a missing mapping
> as an indicator that the fault should be retried.
> 
> We also set the folio's index so, given the folio is being treated as faux
> user memory, it correctly references its offset within the VMA.
> 
> This explains why the mapping and index fields are used - but it's not
> necessary.
> 
> We preallocate pages when perf_mmap() is called for the first time via
> rb_alloc(), and further allocate auxiliary pages via rb_aux_alloc() as
> needed if the mapping requires it.
> 
> This allocation is done in the f_ops->mmap() hook provided in perf_mmap(),
> and so we can instead simply map all the memory right away here - there's
> no point in handling (read) page faults when we don't demand page nor need
> to be notified about them (perf does not).
> 
> This patch therefore changes this logic to map everything when the mmap()
> hook is called, establishing a PFN map. It implements vm_ops->pfn_mkwrite()
> to provide the required read/write vs. read-only behaviour, which does not
> require the previously implemented workarounds.
> 
> While it is not ideal to use a VM_PFNMAP here, doing anything else will
> result in the page_mkwrite() hook need to be provided, which requires the
> same page->mapping hack this patch seeks to undo.
> 
> It will also result in the pages being treated as folios and placed on the
> rmap, which really does not make sense for these mappings.
> 
> Semantically it makes sense to establish this as some kind of special
> mapping, as the pages are managed by perf and are not strictly user pages,
> but currently the only means by which we can do so functionally while
> maintaining the required R/W and R/O bheaviour is a PFN map.
> 
> There should be no change to actual functionality as a result of this
> change.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
> ---
> v2:
> * nommu fixup.
> * Add comment explaining why we are using a VM_PFNMAP as suggested by
>   David H.
> 
> v1:
> https://lore.kernel.org/all/20241128113714.492474-1-lorenzo.stoakes@xxxxxxxxxx/
> 
>  kernel/events/core.c        | 116 ++++++++++++++++++++++++------------
>  kernel/events/ring_buffer.c |  19 +-----
>  2 files changed, 80 insertions(+), 55 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 5d4a54f50826..1bb5999d9d81 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6284,41 +6284,6 @@ void perf_event_update_userpage(struct perf_event *event)
>  }
>  EXPORT_SYMBOL_GPL(perf_event_update_userpage);
> 
> -static vm_fault_t perf_mmap_fault(struct vm_fault *vmf)
> -{
> -	struct perf_event *event = vmf->vma->vm_file->private_data;
> -	struct perf_buffer *rb;
> -	vm_fault_t ret = VM_FAULT_SIGBUS;
> -
> -	if (vmf->flags & FAULT_FLAG_MKWRITE) {
> -		if (vmf->pgoff == 0)
> -			ret = 0;
> -		return ret;
> -	}
> -
> -	rcu_read_lock();
> -	rb = rcu_dereference(event->rb);
> -	if (!rb)
> -		goto unlock;
> -
> -	if (vmf->pgoff && (vmf->flags & FAULT_FLAG_WRITE))
> -		goto unlock;
> -
> -	vmf->page = perf_mmap_to_page(rb, vmf->pgoff);
> -	if (!vmf->page)
> -		goto unlock;
> -
> -	get_page(vmf->page);
> -	vmf->page->mapping = vmf->vma->vm_file->f_mapping;
> -	vmf->page->index   = vmf->pgoff;
> -
> -	ret = 0;
> -unlock:
> -	rcu_read_unlock();
> -
> -	return ret;
> -}
> -
>  static void ring_buffer_attach(struct perf_event *event,
>  			       struct perf_buffer *rb)
>  {
> @@ -6558,13 +6523,87 @@ static void perf_mmap_close(struct vm_area_struct *vma)
>  	ring_buffer_put(rb); /* could be last */
>  }
> 
> +static vm_fault_t perf_mmap_pfn_mkwrite(struct vm_fault *vmf)
> +{
> +	/* The first page is the user control page, others are read-only. */
> +	return vmf->pgoff == 0 ? 0 : VM_FAULT_SIGBUS;
> +}
> +
>  static const struct vm_operations_struct perf_mmap_vmops = {
>  	.open		= perf_mmap_open,
>  	.close		= perf_mmap_close, /* non mergeable */
> -	.fault		= perf_mmap_fault,
> -	.page_mkwrite	= perf_mmap_fault,
> +	.pfn_mkwrite	= perf_mmap_pfn_mkwrite,
>  };
> 
> +static int map_range(struct perf_buffer *rb, struct vm_area_struct *vma)
> +{
> +	unsigned long nr_pages = vma_pages(vma);
> +	int err = 0;
> +	unsigned long pgoff;
> +
> +	/*
> +	 * We map this as a VM_PFNMAP VMA.
> +	 *
> +	 * This is not ideal as this is designed broadly for mappings of PFNs
> +	 * referencing memory-mapped I/O ranges or non-system RAM i.e. for which
> +	 * !pfn_valid(pfn).
> +	 *
> +	 * We are mapping kernel-allocated memory (memory we manage ourselves)
> +	 * which would more ideally be mapped using vm_insert_page() or a
> +	 * similar mechanism, that is as a VM_MIXEDMAP mapping.
> +	 *
> +	 * However this won't work here, because:
> +	 *
> +	 * 1. It uses vma->vm_page_prot, but this field has not been completely
> +	 *    setup at the point of the f_op->mmp() hook, so we are unable to
> +	 *    indicate that this should be mapped CoW in order that the
> +	 *    mkwrite() hook can be invoked to make the first page R/W and the
> +	 *    rest R/O as desired.
> +	 *
> +	 * 2. Anything other than a VM_PFNMAP of valid PFNs will result in
> +	 *    vm_normal_page() returning a struct page * pointer, which means
> +	 *    vm_ops->page_mkwrite() will be invoked rather than
> +	 *    vm_ops->pfn_mkwrite(), and this means we have to set page->mapping
> +	 *    to work around retry logic in the fault handler, however this
> +	 *    field is no longer allowed to be used within struct page.
> +	 *
> +	 * 3. Having a struct page * made available in the fault logic also
> +	 *    means that the page gets put on the rmap and becomes
> +	 *    inappropriately accessible and subject to map and ref counting.
> +	 *
> +	 * Ideally we would have a mechanism that could explicitly express our
> +	 * desires, but this is not currently the case, so we instead use
> +	 * VM_PFNMAP.
> +	 *
> +	 * We manage the lifetime of these mappings with internal refcounts (see
> +	 * perf_mmap_open() and perf_mmap_close()) so we ensure the lifetime of
> +	 * this mapping is maintained correctly.
> +	 */
> +	for (pgoff = 0; pgoff < nr_pages; pgoff++) {
> +		unsigned long va = vma->vm_start + PAGE_SIZE * pgoff;
> +		struct page *page = perf_mmap_to_page(rb, pgoff);
> +
> +		if (page == NULL) {
> +			err = -EINVAL;
> +			break;
> +		}
> +
> +		/* Map readonly, perf_mmap_pfn_mkwrite() called on write fault. */
> +		err = remap_pfn_range(vma, va, page_to_pfn(page), PAGE_SIZE,
> +				      vm_get_page_prot(vma->vm_flags & ~VM_SHARED));
> +		if (err)
> +			break;
> +	}
> +
> +#ifdef CONFIG_MMU
> +	/* Clear any partial mappings on error. */
> +	if (err)
> +		zap_page_range_single(vma, vma->vm_start, nr_pages * PAGE_SIZE, NULL);
> +#endif
> +
> +	return err;
> +}
> +
>  static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>  	struct perf_event *event = file->private_data;
> @@ -6783,6 +6822,9 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  	vm_flags_set(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP);
>  	vma->vm_ops = &perf_mmap_vmops;
> 
> +	if (!ret)
> +		ret = map_range(rb, vma);
> +
>  	if (event->pmu->event_mapped)
>  		event->pmu->event_mapped(event, vma->vm_mm);
> 
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 4f46f688d0d4..180509132d4b 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -643,7 +643,6 @@ static void rb_free_aux_page(struct perf_buffer *rb, int idx)
>  	struct page *page = virt_to_page(rb->aux_pages[idx]);
> 
>  	ClearPagePrivate(page);
> -	page->mapping = NULL;
>  	__free_page(page);
>  }
> 
> @@ -819,7 +818,6 @@ static void perf_mmap_free_page(void *addr)
>  {
>  	struct page *page = virt_to_page(addr);
> 
> -	page->mapping = NULL;
>  	__free_page(page);
>  }
> 
> @@ -890,28 +888,13 @@ __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff)
>  	return vmalloc_to_page((void *)rb->user_page + pgoff * PAGE_SIZE);
>  }
> 
> -static void perf_mmap_unmark_page(void *addr)
> -{
> -	struct page *page = vmalloc_to_page(addr);
> -
> -	page->mapping = NULL;
> -}
> -
>  static void rb_free_work(struct work_struct *work)
>  {
>  	struct perf_buffer *rb;
> -	void *base;
> -	int i, nr;
> 
>  	rb = container_of(work, struct perf_buffer, work);
> -	nr = data_page_nr(rb);
> -
> -	base = rb->user_page;
> -	/* The '<=' counts in the user page. */
> -	for (i = 0; i <= nr; i++)
> -		perf_mmap_unmark_page(base + (i * PAGE_SIZE));
> 
> -	vfree(base);
> +	vfree(rb->user_page);
>  	kfree(rb);
>  }
> 
> --
> 2.47.1

Hi Lorenzo Stoakes,

Greetings!

I used Syzkaller and found that there is general protection fault in perf_mmap_to_page in linux-next next-20241203.

After bisection and the first bad commit is:
"
eca51ce01d49 perf: Map pages in advance
"

All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/241204_084442_perf_mmap_to_page/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/241204_084442_perf_mmap_to_page/bzImage_c245a7a79602ccbee780c004c1e4abcda66aec32
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/241204_084442_perf_mmap_to_page/c245a7a79602ccbee780c004c1e4abcda66aec32_dmesg.log

"
[   22.133358] KASAN: null-ptr-deref in range [0x0000000000000178-0x000000000000017f]
[   22.133907] CPU: 0 UID: 0 PID: 727 Comm: repro Not tainted 6.13.0-rc1-next-20241203-c245a7a79602 #1
[   22.134557] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   22.135371] RIP: 0010:perf_mmap_to_page+0x39/0x500
[   22.135763] Code: 41 56 41 55 41 54 49 89 f4 53 48 89 fb e8 3f 5f c2 ff 48 8d bb 78 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e e9 03 00 00 4c 63 ab 78 01 00
[   22.137075] RSP: 0018:ffff888020f0f798 EFLAGS: 00010202
[   22.137465] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000
[   22.137980] RDX: 000000000000002f RSI: ffffffff81a5ccf1 RDI: 0000000000000178
[   22.138495] RBP: ffff888020f0f7c0 R08: 0000000000000001 R09: ffffed10025fbdb0
[   22.139012] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[   22.139530] R13: 0000000000000000 R14: 0000000020002000 R15: ffff888011cce3c0
[   22.140047] FS:  00007f7f57f30600(0000) GS:ffff88806c400000(0000) knlGS:0000000000000000
[   22.140630] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   22.141052] CR2: 00000000200000c0 CR3: 0000000014e10004 CR4: 0000000000770ef0
[   22.141570] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   22.142088] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[   22.142606] PKRU: 55555554
[   22.142815] Call Trace:
[   22.143005]  <TASK>
[   22.143173]  ? show_regs+0x6d/0x80
[   22.143455]  ? die_addr+0x45/0xb0
[   22.143720]  ? exc_general_protection+0x1ae/0x340
[   22.144102]  ? asm_exc_general_protection+0x2b/0x30
[   22.144486]  ? perf_mmap_to_page+0x21/0x500
[   22.144810]  ? perf_mmap_to_page+0x39/0x500
[   22.145130]  ? perf_mmap_to_page+0x21/0x500
[   22.145448]  perf_mmap+0xbd9/0x1ce0
[   22.145729]  __mmap_region+0x10e7/0x25a0
[   22.146038]  ? __pfx___mmap_region+0x10/0x10
[   22.146376]  ? mark_lock.part.0+0xf3/0x17b0
[   22.146712]  ? __pfx_mark_lock.part.0+0x10/0x10
[   22.147071]  ? __kasan_check_read+0x15/0x20
[   22.147403]  ? mark_lock.part.0+0xf3/0x17b0
[   22.147744]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[   22.148162]  ? trace_cap_capable+0x78/0x1e0
[   22.148500]  ? cap_capable+0xa4/0x250
[   22.148792]  mmap_region+0x248/0x2f0
[   22.149086]  do_mmap+0xb29/0x12a0
[   22.149355]  ? __pfx_do_mmap+0x10/0x10
[   22.149651]  ? __pfx_down_write_killable+0x10/0x10
[   22.150027]  ? __this_cpu_preempt_check+0x21/0x30
[   22.150393]  vm_mmap_pgoff+0x235/0x3e0
[   22.150699]  ? __pfx_vm_mmap_pgoff+0x10/0x10
[   22.151037]  ? __fget_files+0x1fb/0x3a0
[   22.151352]  ksys_mmap_pgoff+0x3dc/0x520
[   22.151664]  __x64_sys_mmap+0x139/0x1d0
[   22.151975]  x64_sys_call+0x2001/0x2140
[   22.152283]  do_syscall_64+0x6d/0x140
[   22.152572]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   22.152960] RIP: 0033:0x7f7f57c3ee5d
[   22.153251] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
[   22.154593] RSP: 002b:00007ffd805489f8 EFLAGS: 00000212 ORIG_RAX: 0000000000000009
[   22.155156] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7f57c3ee5d
[   22.155683] RDX: 0000000000000000 RSI: 0000000000001000 RDI: 0000000020002000
[   22.156210] RBP: 00007ffd80548a20 R08: 0000000000000003 R09: 0000000000000000
[   22.156739] R10: 0000000000006053 R11: 0000000000000212 R12: 00007ffd80548b38
[   22.157263] R13: 0000000000401126 R14: 0000000000403e08 R15: 00007f7f57f77000
[   22.157799]  </TASK>
[   22.157975] Modules linked in:
[   22.158322] ---[ end trace 0000000000000000 ]---
[   22.158694] RIP: 0010:perf_mmap_to_page+0x39/0x500
[   22.159061] Code: 41 56 41 55 41 54 49 89 f4 53 48 89 fb e8 3f 5f c2 ff 48 8d bb 78 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e e9 03 00 00 4c 63 ab 78 01 00
[   22.160388] RSP: 0018:ffff888020f0f798 EFLAGS: 00010202
[   22.160782] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000
[   22.161304] RDX: 000000000000002f RSI: ffffffff81a5ccf1 RDI: 0000000000000178
[   22.161824] RBP: ffff888020f0f7c0 R08: 0000000000000001 R09: ffffed10025fbdb0
[   22.162344] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[   22.162877] R13: 0000000000000000 R14: 0000000020002000 R15: ffff888011cce3c0
[   22.163403] FS:  00007f7f57f30600(0000) GS:ffff88806c400000(0000) knlGS:0000000000000000
[   22.163988] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   22.164417] CR2: 00000000200000c0 CR3: 0000000014e10004 CR4: 0000000000770ef0
[   22.165409] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   22.165956] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[   22.166918] PKRU: 55555554
"

I hope you find it useful.

Regards,
Yi Lai

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux