On Wed, Aug 07, 2024 at 10:58:09AM +0200, David Hildenbrand wrote: > On 06.08.24 22:29, Peter Xu wrote: > > On Tue, Aug 06, 2024 at 06:37:55PM +0200, David Hildenbrand wrote: > > > On 06.08.24 17:15, Ryan Roberts wrote: > > > > Hi Peter, David, > > > > Hi, Ryan, > > > > > > > > > > syzkaller has found an issue (at least on arm64, but I suspect it will be > > > > visible on x86_64 too) that triggers the following warning: > > > > This is true. I can easily reproduce.. > > > > > > > > > > [ 2291.836518] ------------[ cut here ]------------ > > > > [ 2291.836528] WARNING: CPU: 3 PID: 9056 at mm/page_table_check.c:207 __page_table_check_ptes_set+0x22c/0x248 > > > > [ 2291.836541] Modules linked in: > > > > [ 2291.836549] CPU: 3 UID: 1000 PID: 9056 Comm: bug Tainted: G W 6.11.0-rc2-dirty #2 > > > > [ 2291.836554] Tainted: [W]=WARN > > > > [ 2291.836557] Hardware name: linux,dummy-virt (DT) > > > > [ 2291.836559] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > > [ 2291.836564] pc : __page_table_check_ptes_set+0x22c/0x248 > > > > [ 2291.836568] lr : ptep_modify_prot_commit+0x24c/0x2b0 > > > > [ 2291.836573] sp : ffff80008ca6ba20 > > > > [ 2291.836575] x29: ffff80008ca6ba20 x28: ffff186392d1eb00 x27: 0000000020ffd000 > > > > [ 2291.836598] x26: 0010000000000001 x25: 0000000000000001 x24: 0000000000000000 > > > > [ 2291.836605] x23: 04e800018c738f43 x22: 0000000000000001 x21: ffff1863824163c0 > > > > [ 2291.836612] x20: 04e800018c738f43 x19: 04e800018c738f43 x18: 0000fffff7f87fff > > > > [ 2291.836619] x17: 0000000000000000 x16: 1fffe30c748d22a1 x15: 0060000000000fc3 > > > > [ 2291.836625] x14: 0000000000000000 x13: 0000000020ffd000 x12: 0000fffff7f87fff > > > > [ 2291.836631] x11: 0000000020ffd000 x10: 0000000000000000 x9 : ffffbcab99e3ab84 > > > > [ 2291.836638] x8 : ffff186382b8f000 x7 : 0000000020ffe000 x6 : 0000000020ffd000 > > > > [ 2291.836644] x5 : ffff186392d1eb00 x4 : 04e800018c738f43 x3 : 0000000000000001 > > > > [ 2291.836650] x2 : 04e800018c738f43 x1 : ffff18639fe01fe8 x0 : ffffbcab9ce56780 > > > > [ 2291.836657] Call trace: > > > > [ 2291.836659] __page_table_check_ptes_set+0x22c/0x248 > > > > [ 2291.836664] ptep_modify_prot_commit+0x24c/0x2b0 > > > > [ 2291.836667] change_protection+0x8a0/0x1100 > > > > [ 2291.836672] mprotect_fixup+0x124/0x2d0 > > > > [ 2291.836675] do_mprotect_pkey.constprop.0+0x29c/0x460 > > > > [ 2291.836679] __arm64_sys_mprotect+0x24/0xf8 > > > > [ 2291.836682] invoke_syscall+0x50/0x120 > > > > [ 2291.836690] el0_svc_common.constprop.0+0x48/0xf0 > > > > [ 2291.836694] do_el0_svc+0x24/0x38 > > > > [ 2291.836699] el0_svc+0x34/0xe0 > > > > [ 2291.836705] el0t_64_sync_handler+0x100/0x130 > > > > [ 2291.836709] el0t_64_sync+0x190/0x198 > > > > [ 2291.836713] ---[ end trace 0000000000000000 ]--- > > > > > > > > The generated program (see below) mmaps a 16M region (RWX). It then mlocks all > > > > current and future memory. > > > > > > > > Next, it registers 12K (3 pages) for use with UFFD-WP, and marks 4 pages > > > > UFFD-WP'ed. This returns ENOENT because we only registered 3 pages, but those 3 > > > > pages are still UFFD-WP'ed in their PTE, so this error is not relavent to the > > > > bug. At this point, there is a single VMA covering the 12K, with VM_UFFD_WP set, > > > > amongst other flags: > > > > > > > > 20ffb000-20ffe000 rwxp 00000000 00:00 0 > > > > Size: 12 kB > > > > KernelPageSize: 4 kB > > > > MMUPageSize: 4 kB > > > > Rss: 12 kB > > > > Pss: 12 kB > > > > Pss_Dirty: 12 kB > > > > Shared_Clean: 0 kB > > > > Shared_Dirty: 0 kB > > > > Private_Clean: 0 kB > > > > Private_Dirty: 12 kB > > > > Referenced: 12 kB > > > > Anonymous: 12 kB > > > > KSM: 0 kB > > > > LazyFree: 0 kB > > > > AnonHugePages: 0 kB > > > > ShmemPmdMapped: 0 kB > > > > FilePmdMapped: 0 kB > > > > Shared_Hugetlb: 0 kB > > > > Private_Hugetlb: 0 kB > > > > Swap: 0 kB > > > > SwapPss: 0 kB > > > > Locked: 12 kB > > > > THPeligible: 0 > > > > VmFlags: rd wr ex mr mw me uw lo ac > > > > > > > > Next we mremap the first page to the address where the last page was previously > > > > mapped, with MREMAP_DONTUNMAP. This leads to 2 VMAs, but the new one doesn't > > > > have VM_UFFD_WP set (Note also that the original VMA no longer has VM_LOCKED > > > > which seems wrong to me, but I'll ignore that for now): > > > > > > > > 20ffb000-20ffd000 rwxp 00000000 00:00 0 > > > > Size: 8 kB > > > > KernelPageSize: 4 kB > > > > MMUPageSize: 4 kB > > > > Rss: 4 kB > > > > Pss: 4 kB > > > > Pss_Dirty: 4 kB > > > > Shared_Clean: 0 kB > > > > Shared_Dirty: 0 kB > > > > Private_Clean: 0 kB > > > > Private_Dirty: 4 kB > > > > Referenced: 4 kB > > > > Anonymous: 4 kB > > > > KSM: 0 kB > > > > LazyFree: 0 kB > > > > AnonHugePages: 0 kB > > > > ShmemPmdMapped: 0 kB > > > > FilePmdMapped: 0 kB > > > > Shared_Hugetlb: 0 kB > > > > Private_Hugetlb: 0 kB > > > > Swap: 0 kB > > > > SwapPss: 0 kB > > > > Locked: 0 kB > > > > THPeligible: 0 > > > > VmFlags: rd wr ex mr mw me uw ac > > > > 20ffd000-20ffe000 rwxp 00000000 00:00 0 > > > > Size: 4 kB > > > > KernelPageSize: 4 kB > > > > MMUPageSize: 4 kB > > > > Rss: 4 kB > > > > Pss: 4 kB > > > > Pss_Dirty: 4 kB > > > > Shared_Clean: 0 kB > > > > Shared_Dirty: 0 kB > > > > Private_Clean: 0 kB > > > > Private_Dirty: 4 kB > > > > Referenced: 4 kB > > > > Anonymous: 4 kB > > > > KSM: 0 kB > > > > LazyFree: 0 kB > > > > AnonHugePages: 0 kB > > > > ShmemPmdMapped: 0 kB > > > > FilePmdMapped: 0 kB > > > > Shared_Hugetlb: 0 kB > > > > Private_Hugetlb: 0 kB > > > > Swap: 0 kB > > > > SwapPss: 0 kB > > > > Locked: 4 kB > > > > THPeligible: 0 > > > > VmFlags: rd wr ex mr mw me lo ac > > > > > > > > Finally we try to mprotect that last 4K region to remove X, and we get the > > > > warning saying the PTE has both the UFFD-WP and WRITE bits set. > > > > > > > > I'm guessing this is because the VM_UFFD_WP flag got spuriously dropped when > > > > creating the final 4K VMA and so mprotect's can_change_pte_writable() check > > > > incorrectly allowed the pte to be marked writable. But the mremap man page is > > > > not very clear on the semantics when interacting with uffd regions; perhaps > > > > uffd-wp bit should have been cleared when mremapping the ptes? > > > > > > > > I'm hoping you can advice on the expected semantics and we can figure out how to > > > > solve this? > > > > > > > > > > > > The reproducer is as follows (with a few annotations added by me): > > > > > > > > """ > > > > // autogenerated by syzkaller (https://github.com/google/syzkaller) > > > > > > > > #define _GNU_SOURCE > > > > > > > > #include <endian.h> > > > > #include <stdint.h> > > > > #include <stdio.h> > > > > #include <stdlib.h> > > > > #include <string.h> > > > > #include <sys/syscall.h> > > > > #include <sys/types.h> > > > > #include <unistd.h> > > > > > > > > #ifndef __NR_ioctl > > > > #define __NR_ioctl 29 > > > > #endif > > > > #ifndef __NR_mlockall > > > > #define __NR_mlockall 230 > > > > #endif > > > > #ifndef __NR_mmap > > > > #define __NR_mmap 222 > > > > #endif > > > > #ifndef __NR_mprotect > > > > #define __NR_mprotect 226 > > > > #endif > > > > #ifndef __NR_mremap > > > > #define __NR_mremap 216 > > > > #endif > > > > #ifndef __NR_userfaultfd > > > > #define __NR_userfaultfd 282 > > > > #endif > > > > > > > > uint64_t r[1] = {0xffffffffffffffff}; > > > > > > > > int main(void) > > > > { > > > > intptr_t res = 0; > > > > > > > > syscall(__NR_mmap, /*addr=*/0x1ffff000ul, /*len=*/0x1000ul, /*prot=*/0ul, /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/0x32ul, /*fd=*/-1, /*offset=*/0ul); > > > > syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0x1000000ul, /*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/7ul, /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/0x32ul, /*fd=*/-1, /*offset=*/0ul); > > > > syscall(__NR_mmap, /*addr=*/0x21000000ul, /*len=*/0x1000ul, /*prot=*/0ul, /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/0x32ul, /*fd=*/-1, /*offset=*/0ul); > > > > > > > > write(1, "executing program\n", sizeof("executing program\n") - 1); > > > > > > > > // userfaultfd(UFFD_USER_MODE_ONLY) = 3 > > > > res = syscall(__NR_userfaultfd, /*flags=UFFD_USER_MODE_ONLY*/1ul); > > > > if (res != -1) > > > > r[0] = res; > > > > > > > > // ioctl(3, UFFDIO_API, {api=0xaa, features=0 => features=UFFD_FEATURE_PAGEFAULT_FLAG_WP|UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP|UFFD_FEATURE_EVENT_REMOVE|UFFD_FEATURE_MISSING_HUGETLBFS|UFFD_FEATURE_MISSING_SHMEM|UFFD_FEATURE_EVENT_UNMAP|UFFD_FEATURE_SIGBUS|UFFD_FEATURE_THREAD_ID|UFFD_FEATURE_MINOR_HUGETLBFS|UFFD_FEATURE_MINOR_SHMEM|0x1f800, ioctls=1<<_UFFDIO_REGISTER|1<<_UFFDIO_UNREGISTER|1<<_UFFDIO_API}) = 0 > > > > *(uint64_t*)0x20000000 = 0xaa; > > > > *(uint64_t*)0x20000008 = 0; > > > > *(uint64_t*)0x20000010 = 0; > > > > syscall(__NR_ioctl, /*fd=*/r[0], /*cmd=*/0xc018aa3f, /*arg=*/0x20000000ul); > > > > > > > > syscall(__NR_mlockall, /*flags=MCL_FUTURE|MCL_CURRENT*/3ul); > > > > > > > > // ioctl(3, UFFDIO_REGISTER, {range={start=0x20ffb000, len=0x3000}, mode=UFFDIO_REGISTER_MODE_WP, ioctls=1<<_UFFDIO_WAKE|1<<_UFFDIO_COPY|1<<_UFFDIO_ZEROPAGE|1<<_UFFDIO_WRITEPROTECT|0x120}) = 0 > > > > *(uint64_t*)0x20000180 = 0x20ffb000; > > > > *(uint64_t*)0x20000188 = 0x3000; > > > > *(uint64_t*)0x20000190 = 2; > > > > *(uint64_t*)0x20000198 = 0; > > > > syscall(__NR_ioctl, /*fd=*/r[0], /*cmd=*/0xc020aa00, /*arg=*/0x20000180ul); > > > > > > > > // ioctl(3, UFFDIO_WRITEPROTECT, 0x20000080) = -1 ENOENT (No such file or directory) > > > > *(uint64_t*)0x20000080 = 0x20ffb000; > > > > *(uint64_t*)0x20000088 = 0x4000; > > > > *(uint64_t*)0x20000090 = 1; > > > > syscall(__NR_ioctl, /*fd=*/r[0], /*cmd=*/0xc018aa06, /*arg=*/0x20000080ul); > > > > > > > > syscall(__NR_mremap, /*addr=*/0x20ffb000ul, /*len=*/0x1000ul, /*newlen=*/0x1000ul, /*flags=MREMAP_DONTUNMAP|MREMAP_FIXED|MREMAP_MAYMOVE*/7ul, /*newaddr=*/0x20ffd000ul); > > > > syscall(__NR_mprotect, /*addr=*/0x20ffd000ul, /*len=*/0x1000ul, /*prot=PROT_WRITE|PROT_READ*/3ul); > > > > > > > > return 0; > > > > } > > > > """ > > > > > > > > I'd appreciate any thoughts you may have! > > > > > > Interesting. Either the vma flag shouldn't get dropped or we should un-mark > > > the PTEs. > > > > > > Is the vma flag maybe getting dropped because of some weird interaction with > > > UFFD_EVENT_REMAP? > > > > Right, I think we should do the latter. > > > > We need to drop the vma flag by default, as you quoted in the other patch > > in 2018, as the monitor process may not be able to process this otherwise, > > seeing unknown address reported when read(). So instead we should drop the > > uffd-wp bit here.. > > For the records: I don't think the patch from 2018 made the right call. > > Was there a particular reason for the VMA flag changes (BUG report?). I can > see why the wrongly sent event was problematic and had to be fixed. No bug report. > > When dropping these VMA flags, especially for the missing mode, the app will > suddenly get *wrong* data. Instead of the uffd monitor being in charge what > to place, we will give it zeroed pages. > > To me that translates to a silent memory corruption. > > In contrast, if the monitor the pagefault information and let him realize > that most likely he should be using UFFD_EVENT_REMAP. > > I would reconsider that change in 2018. To me it would make more sense to > not drop flags during mremap. The question is if without the remap event, even if the fault will be trapped properly, I don't see a way that a monitor process can know what to fill in.. I think that's why a monitor must register with the remap event if mremap() can happen. But yeah that's my limited understanding. Let me copy Mike too. Thanks, -- Peter Xu