Re: Warning on mremapped uffd-wp memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07/08/2024 19:59, Peter Xu wrote:
> On Wed, Aug 07, 2024 at 12:18:18PM +0200, David Hildenbrand wrote:
>> On 07.08.24 10:58, David Hildenbrand wrote:
>>> On 06.08.24 22:29, Peter Xu wrote:
>>>> On Tue, Aug 06, 2024 at 06:37:55PM +0200, David Hildenbrand wrote:
>>>>> On 06.08.24 17:15, Ryan Roberts wrote:
>>>>>> Hi Peter, David,
>>>>
>>>> Hi, Ryan,
>>>>
>>>>>>
>>>>>> syzkaller has found an issue (at least on arm64, but I suspect it will be
>>>>>> visible on x86_64 too) that triggers the following warning:
>>>>
>>>> This is true.  I can easily reproduce..
>>>>
>>>>>>
>>>>>> [ 2291.836518] ------------[ cut here ]------------
>>>>>> [ 2291.836528] WARNING: CPU: 3 PID: 9056 at mm/page_table_check.c:207 __page_table_check_ptes_set+0x22c/0x248
>>>>>> [ 2291.836541] Modules linked in:
>>>>>> [ 2291.836549] CPU: 3 UID: 1000 PID: 9056 Comm: bug Tainted: G        W          6.11.0-rc2-dirty #2
>>>>>> [ 2291.836554] Tainted: [W]=WARN
>>>>>> [ 2291.836557] Hardware name: linux,dummy-virt (DT)
>>>>>> [ 2291.836559] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>> [ 2291.836564] pc : __page_table_check_ptes_set+0x22c/0x248
>>>>>> [ 2291.836568] lr : ptep_modify_prot_commit+0x24c/0x2b0
>>>>>> [ 2291.836573] sp : ffff80008ca6ba20
>>>>>> [ 2291.836575] x29: ffff80008ca6ba20 x28: ffff186392d1eb00 x27: 0000000020ffd000
>>>>>> [ 2291.836598] x26: 0010000000000001 x25: 0000000000000001 x24: 0000000000000000
>>>>>> [ 2291.836605] x23: 04e800018c738f43 x22: 0000000000000001 x21: ffff1863824163c0
>>>>>> [ 2291.836612] x20: 04e800018c738f43 x19: 04e800018c738f43 x18: 0000fffff7f87fff
>>>>>> [ 2291.836619] x17: 0000000000000000 x16: 1fffe30c748d22a1 x15: 0060000000000fc3
>>>>>> [ 2291.836625] x14: 0000000000000000 x13: 0000000020ffd000 x12: 0000fffff7f87fff
>>>>>> [ 2291.836631] x11: 0000000020ffd000 x10: 0000000000000000 x9 : ffffbcab99e3ab84
>>>>>> [ 2291.836638] x8 : ffff186382b8f000 x7 : 0000000020ffe000 x6 : 0000000020ffd000
>>>>>> [ 2291.836644] x5 : ffff186392d1eb00 x4 : 04e800018c738f43 x3 : 0000000000000001
>>>>>> [ 2291.836650] x2 : 04e800018c738f43 x1 : ffff18639fe01fe8 x0 : ffffbcab9ce56780
>>>>>> [ 2291.836657] Call trace:
>>>>>> [ 2291.836659]  __page_table_check_ptes_set+0x22c/0x248
>>>>>> [ 2291.836664]  ptep_modify_prot_commit+0x24c/0x2b0
>>>>>> [ 2291.836667]  change_protection+0x8a0/0x1100
>>>>>> [ 2291.836672]  mprotect_fixup+0x124/0x2d0
>>>>>> [ 2291.836675]  do_mprotect_pkey.constprop.0+0x29c/0x460
>>>>>> [ 2291.836679]  __arm64_sys_mprotect+0x24/0xf8
>>>>>> [ 2291.836682]  invoke_syscall+0x50/0x120
>>>>>> [ 2291.836690]  el0_svc_common.constprop.0+0x48/0xf0
>>>>>> [ 2291.836694]  do_el0_svc+0x24/0x38
>>>>>> [ 2291.836699]  el0_svc+0x34/0xe0
>>>>>> [ 2291.836705]  el0t_64_sync_handler+0x100/0x130
>>>>>> [ 2291.836709]  el0t_64_sync+0x190/0x198
>>>>>> [ 2291.836713] ---[ end trace 0000000000000000 ]---
>>>>>>
>>>>>> The generated program (see below) mmaps a 16M region (RWX). It then mlocks all
>>>>>> current and future memory.
>>>>>>
>>>>>> Next, it registers 12K (3 pages) for use with UFFD-WP, and marks 4 pages
>>>>>> UFFD-WP'ed. This returns ENOENT because we only registered 3 pages, but those 3
>>>>>> pages are still UFFD-WP'ed in their PTE, so this error is not relavent to the
>>>>>> bug. At this point, there is a single VMA covering the 12K, with VM_UFFD_WP set,
>>>>>> amongst other flags:
>>>>>>
>>>>>>      20ffb000-20ffe000 rwxp 00000000 00:00 0
>>>>>>      Size:                 12 kB
>>>>>>      KernelPageSize:        4 kB
>>>>>>      MMUPageSize:           4 kB
>>>>>>      Rss:                  12 kB
>>>>>>      Pss:                  12 kB
>>>>>>      Pss_Dirty:            12 kB
>>>>>>      Shared_Clean:          0 kB
>>>>>>      Shared_Dirty:          0 kB
>>>>>>      Private_Clean:         0 kB
>>>>>>      Private_Dirty:        12 kB
>>>>>>      Referenced:           12 kB
>>>>>>      Anonymous:            12 kB
>>>>>>      KSM:                   0 kB
>>>>>>      LazyFree:              0 kB
>>>>>>      AnonHugePages:         0 kB
>>>>>>      ShmemPmdMapped:        0 kB
>>>>>>      FilePmdMapped:         0 kB
>>>>>>      Shared_Hugetlb:        0 kB
>>>>>>      Private_Hugetlb:       0 kB
>>>>>>      Swap:                  0 kB
>>>>>>      SwapPss:               0 kB
>>>>>>      Locked:               12 kB
>>>>>>      THPeligible:           0
>>>>>>      VmFlags: rd wr ex mr mw me uw lo ac
>>>>>>
>>>>>> Next we mremap the first page to the address where the last page was previously
>>>>>> mapped, with MREMAP_DONTUNMAP. This leads to 2 VMAs, but the new one doesn't
>>>>>> have VM_UFFD_WP set (Note also that the original VMA no longer has VM_LOCKED
>>>>>> which seems wrong to me, but I'll ignore that for now):
>>>>>>
>>>>>>      20ffb000-20ffd000 rwxp 00000000 00:00 0
>>>>>>      Size:                  8 kB
>>>>>>      KernelPageSize:        4 kB
>>>>>>      MMUPageSize:           4 kB
>>>>>>      Rss:                   4 kB
>>>>>>      Pss:                   4 kB
>>>>>>      Pss_Dirty:             4 kB
>>>>>>      Shared_Clean:          0 kB
>>>>>>      Shared_Dirty:          0 kB
>>>>>>      Private_Clean:         0 kB
>>>>>>      Private_Dirty:         4 kB
>>>>>>      Referenced:            4 kB
>>>>>>      Anonymous:             4 kB
>>>>>>      KSM:                   0 kB
>>>>>>      LazyFree:              0 kB
>>>>>>      AnonHugePages:         0 kB
>>>>>>      ShmemPmdMapped:        0 kB
>>>>>>      FilePmdMapped:         0 kB
>>>>>>      Shared_Hugetlb:        0 kB
>>>>>>      Private_Hugetlb:       0 kB
>>>>>>      Swap:                  0 kB
>>>>>>      SwapPss:               0 kB
>>>>>>      Locked:                0 kB
>>>>>>      THPeligible:           0
>>>>>>      VmFlags: rd wr ex mr mw me uw ac
>>>>>>      20ffd000-20ffe000 rwxp 00000000 00:00 0
>>>>>>      Size:                  4 kB
>>>>>>      KernelPageSize:        4 kB
>>>>>>      MMUPageSize:           4 kB
>>>>>>      Rss:                   4 kB
>>>>>>      Pss:                   4 kB
>>>>>>      Pss_Dirty:             4 kB
>>>>>>      Shared_Clean:          0 kB
>>>>>>      Shared_Dirty:          0 kB
>>>>>>      Private_Clean:         0 kB
>>>>>>      Private_Dirty:         4 kB
>>>>>>      Referenced:            4 kB
>>>>>>      Anonymous:             4 kB
>>>>>>      KSM:                   0 kB
>>>>>>      LazyFree:              0 kB
>>>>>>      AnonHugePages:         0 kB
>>>>>>      ShmemPmdMapped:        0 kB
>>>>>>      FilePmdMapped:         0 kB
>>>>>>      Shared_Hugetlb:        0 kB
>>>>>>      Private_Hugetlb:       0 kB
>>>>>>      Swap:                  0 kB
>>>>>>      SwapPss:               0 kB
>>>>>>      Locked:                4 kB
>>>>>>      THPeligible:           0
>>>>>>      VmFlags: rd wr ex mr mw me lo ac
>>>>>>
>>>>>> Finally we try to mprotect that last 4K region to remove X, and we get the
>>>>>> warning saying the PTE has both the UFFD-WP and WRITE bits set.
>>>>>>
>>>>>> I'm guessing this is because the VM_UFFD_WP flag got spuriously dropped when
>>>>>> creating the final 4K VMA and so mprotect's can_change_pte_writable() check
>>>>>> incorrectly allowed the pte to be marked writable. But the mremap man page is
>>>>>> not very clear on the semantics when interacting with uffd regions; perhaps
>>>>>> uffd-wp bit should have been cleared when mremapping the ptes?
>>>>>>
>>>>>> I'm hoping you can advice on the expected semantics and we can figure out how to
>>>>>> solve this?
>>>>>>
>>>>>>
>>>>>> The reproducer is as follows (with a few annotations added by me):
>>>>>>
>>>>>> """
>>>>>> // autogenerated by syzkaller (https://github.com/google/syzkaller)
>>>>>>
>>>>>> #define _GNU_SOURCE
>>>>>>
>>>>>> #include <endian.h>
>>>>>> #include <stdint.h>
>>>>>> #include <stdio.h>
>>>>>> #include <stdlib.h>
>>>>>> #include <string.h>
>>>>>> #include <sys/syscall.h>
>>>>>> #include <sys/types.h>
>>>>>> #include <unistd.h>
>>>>>>
>>>>>> #ifndef __NR_ioctl
>>>>>> #define __NR_ioctl 29
>>>>>> #endif
>>>>>> #ifndef __NR_mlockall
>>>>>> #define __NR_mlockall 230
>>>>>> #endif
>>>>>> #ifndef __NR_mmap
>>>>>> #define __NR_mmap 222
>>>>>> #endif
>>>>>> #ifndef __NR_mprotect
>>>>>> #define __NR_mprotect 226
>>>>>> #endif
>>>>>> #ifndef __NR_mremap
>>>>>> #define __NR_mremap 216
>>>>>> #endif
>>>>>> #ifndef __NR_userfaultfd
>>>>>> #define __NR_userfaultfd 282
>>>>>> #endif
>>>>>>
>>>>>> uint64_t r[1] = {0xffffffffffffffff};
>>>>>>
>>>>>> int main(void)
>>>>>> {
>>>>>> 	intptr_t res = 0;
>>>>>>
>>>>>> 	syscall(__NR_mmap, /*addr=*/0x1ffff000ul, /*len=*/0x1000ul, /*prot=*/0ul, /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/0x32ul, /*fd=*/-1, /*offset=*/0ul);
>>>>>> 	syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0x1000000ul, /*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/7ul, /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/0x32ul, /*fd=*/-1, /*offset=*/0ul);
>>>>>> 	syscall(__NR_mmap, /*addr=*/0x21000000ul, /*len=*/0x1000ul, /*prot=*/0ul, /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/0x32ul, /*fd=*/-1, /*offset=*/0ul);
>>>>>>
>>>>>> 	write(1, "executing program\n", sizeof("executing program\n") - 1);
>>>>>>
>>>>>> 	// userfaultfd(UFFD_USER_MODE_ONLY)        = 3
>>>>>> 	res = syscall(__NR_userfaultfd, /*flags=UFFD_USER_MODE_ONLY*/1ul);
>>>>>> 	if (res != -1)
>>>>>> 		r[0] = res;
>>>>>>
>>>>>> 	// ioctl(3, UFFDIO_API, {api=0xaa, features=0 => features=UFFD_FEATURE_PAGEFAULT_FLAG_WP|UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP|UFFD_FEATURE_EVENT_REMOVE|UFFD_FEATURE_MISSING_HUGETLBFS|UFFD_FEATURE_MISSING_SHMEM|UFFD_FEATURE_EVENT_UNMAP|UFFD_FEATURE_SIGBUS|UFFD_FEATURE_THREAD_ID|UFFD_FEATURE_MINOR_HUGETLBFS|UFFD_FEATURE_MINOR_SHMEM|0x1f800, ioctls=1<<_UFFDIO_REGISTER|1<<_UFFDIO_UNREGISTER|1<<_UFFDIO_API}) = 0
>>>>>> 	*(uint64_t*)0x20000000 = 0xaa;
>>>>>> 	*(uint64_t*)0x20000008 = 0;
>>>>>> 	*(uint64_t*)0x20000010 = 0;
>>>>>> 	syscall(__NR_ioctl, /*fd=*/r[0], /*cmd=*/0xc018aa3f, /*arg=*/0x20000000ul);
>>>>>>
>>>>>> 	syscall(__NR_mlockall, /*flags=MCL_FUTURE|MCL_CURRENT*/3ul);
>>>>>>
>>>>>> 	// ioctl(3, UFFDIO_REGISTER, {range={start=0x20ffb000, len=0x3000}, mode=UFFDIO_REGISTER_MODE_WP, ioctls=1<<_UFFDIO_WAKE|1<<_UFFDIO_COPY|1<<_UFFDIO_ZEROPAGE|1<<_UFFDIO_WRITEPROTECT|0x120}) = 0
>>>>>> 	*(uint64_t*)0x20000180 = 0x20ffb000;
>>>>>> 	*(uint64_t*)0x20000188 = 0x3000;
>>>>>> 	*(uint64_t*)0x20000190 = 2;
>>>>>> 	*(uint64_t*)0x20000198 = 0;
>>>>>> 	syscall(__NR_ioctl, /*fd=*/r[0], /*cmd=*/0xc020aa00, /*arg=*/0x20000180ul);
>>>>>>
>>>>>> 	// ioctl(3, UFFDIO_WRITEPROTECT, 0x20000080) = -1 ENOENT (No such file or directory)
>>>>>> 	*(uint64_t*)0x20000080 = 0x20ffb000;
>>>>>> 	*(uint64_t*)0x20000088 = 0x4000;
>>>>>> 	*(uint64_t*)0x20000090 = 1;
>>>>>> 	syscall(__NR_ioctl, /*fd=*/r[0], /*cmd=*/0xc018aa06, /*arg=*/0x20000080ul);
>>>>>>
>>>>>> 	syscall(__NR_mremap, /*addr=*/0x20ffb000ul, /*len=*/0x1000ul, /*newlen=*/0x1000ul, /*flags=MREMAP_DONTUNMAP|MREMAP_FIXED|MREMAP_MAYMOVE*/7ul, /*newaddr=*/0x20ffd000ul);
>>>>>> 	syscall(__NR_mprotect, /*addr=*/0x20ffd000ul, /*len=*/0x1000ul, /*prot=PROT_WRITE|PROT_READ*/3ul);
>>>>>>
>>>>>> 	return 0;
>>>>>> }
>>>>>> """
>>>>>>
>>>>>> I'd appreciate any thoughts you may have!
>>>>>
>>>>> Interesting. Either the vma flag shouldn't get dropped or we should un-mark
>>>>> the PTEs.
>>>>>
>>>>> Is the vma flag maybe getting dropped because of some weird interaction with
>>>>> UFFD_EVENT_REMAP?
>>>>
>>>> Right, I think we should do the latter.
>>>>
>>>> We need to drop the vma flag by default, as you quoted in the other patch
>>>> in 2018, as the monitor process may not be able to process this otherwise,
>>>> seeing unknown address reported when read(). So instead we should drop the
>>>> uffd-wp bit here..
>>>
>>> For the records: I don't think the patch from 2018 made the right call.
>>>
>>> Was there a particular reason for the VMA flag changes (BUG report?). I
>>> can see why the wrongly sent event was problematic and had to be fixed.
>>>
>>> When dropping these VMA flags, especially for the missing mode, the app
>>> will suddenly get *wrong* data. Instead of the uffd monitor being in
>>> charge what to place, we will give it zeroed pages.
>>>
>>> To me that translates to a silent memory corruption.
>>>
>>> In contrast, if the monitor the pagefault information and let him
>>> realize that most likely he should be using UFFD_EVENT_REMAP.
>>>
>>> I would reconsider that change in 2018. To me it would make more sense
>>> to not drop flags during mremap.
>>
>> ... looking into the implementation some more, I think there might be more
>> issues lurking, and I'm not 100% sure what the right semantics are.
>>
>> Assume we use mremap() to grow a VMA that has uffd registered. Looks like
>> (did not try reproducing) vma_expandable() would just work and we would end
>> up growing the uffd range by doing vma_merge_extend(). Not sure if that's
>> intended or what the expected semantics are at all with all different corner
>> cases of mremap.
> 
> It may make sense for extension-only case indeed. Considering that the
> monitor will not have a source page to reference when filling in those
> holes anyway, so that if ZEROPAGE would be a fallback to all unknown
> mappings then it could make sense.
> 
> But yeah let's always copy Mike for all cooperative mode userfaults.
> 
> When I'm looking at this specific issue again, it's more than ptes that
> should need to remove the uffd-wp bit.  We have:
> 
>   - pmd/pud/hugetlb in other paths that will need similar care..
> 
>   - move_page_tables() smartness on HAVE_MOVE_PUD.. where we may need to
>     walk the pmd page removing the bits when necessary..
> 
>   - more importantly, mremap_userfaultfd_prep() might be too late if it's
>     after moving pgtables..
> 
>   - [not yet started looking] the mlock issue Ryan mentioned..
> 
> Looks like we'll need more things to fix and test..
> 
> I wished if I can simply disable UFFD_WP + EVENT_REMAP, but I think even
> with that, by default when mremap() we should still logically tear down all
> those uffd-wp bits which is the same as !EVENT_REMAP now..
> 
> Let me know if anyone would like to beat me to it on fixing the whole
> thing, I'd be more than happy..  

Afraid I won't be able to sign up to doing that work.

Otherwise, I'll probably need to postpone
> the fix of this issue for 1-2 weeks but finish some other things first..

There is no hurry from my perspective. I guess its been this way for quite some
time and it was only noticed via fuzzing.

Thanks,
Ryan

> 
> Thanks,
> 





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux