Hi all, When running a fuzzer workload to test an unrelated patch[1], I've been reproducing the VM_BUG_ON() splat below[2] on s390x. I've narrowed the problem down to a deterministic reproducer. The code for that is also appended below[3]. The splat occurs because during fork() we end up down in copy_present_pte() -> page_try_dup_anon_rmap() -> page_needs_cow_for_dma() for a !is_cow mapping, so copy_page_range() didn't acquire the ->write_protect_seq seqlock as expected. After digging into this a bit, I _think_ this boils down to a bug in the s390 arch fault code dealing with a write fault to a !VM_WRITE mapping.. The sequence of events implemented by the reproducer that leads to this: 1. Create a shmem segment and attach it SHM_RDONLY. This causes do_mmap() to set up a !VM_WRITE mapping, but also clear (VM_MAYWRITE|VM_SHARED) on the mapping because the backing shmem file is read-only. 2. Take a write fault on the mapping in kernel mode (via getrlimit() in this case). The write fault ultimately causes getrlimit() to fail with a bad access error, but not before the generic fault code creates an anon_vma mapping for the page. This occurs because first do_dat_exception() calls handle_mm_fault() with FAULT_FLAG_WRITE via the following logic: access = VM_ACCESS_FLAGS; ... if (access == VM_WRITE || is_write) flags |= FAULT_FLAG_WRITE; ... if (unlikely(!(vma->vm_flags & access))) goto out_up; ... fault = handle_mm_fault(...); So the FAULT_FLAG_WRITE fault proceeds because is_write is true and ->vm_flags has read or exec permission (but not VM_WRITE). This eventually gets down into do_cow_fault() -> finish_fault() -> do_set_pte(), the latter of which calls page_add_new_anon_rmap() because this is a write fault to a !shared mapping. Note this is immediately followed by a do_protection_exception() that uses access = VM_WRITE and thus fails the above check and returns with VM_FAULT_BADACCESS. So I think this ultimately DTRT wrt to failing the syscall to userspace, but the do_dat_exception() handling sets up an unexpected situation for fork().. 3. fork() -> dup_mm() comes across this mapping with ->anon_vma set (so vma_needs_copy() returns true), but is_cow_mapping() returns false because VM_MAYWRITE is cleared. From there we fall down into the page table copying path described by the BUG splat. This problem doesn't occur on x86 seemingly because we don't call into handle_mm_fault() for a write fault to a !VM_WRITE mapping, which is specifically checked in access_error(). Therefore, something like the patch below[4] seems to prevent the problem on s390. However, the access checking logic looks wonky enough to me that I wonder whether it warrants a closer look from s390 experts. For example, does this code need to care about any flags/context beyond write or read faults vs. write or !write mappings? I don't have enough context to reason about it. Could somebody more familiar with these two s390 exception variants chime in? Finally, note that so far I've only really tested the patch against the reproducer. I'm happy to try and form it into a proper patch and test further after some feedback... thanks. Brian [1] https://lore.kernel.org/linux-s390/20220816155407.537372-1-bfoster@xxxxxxxxxx [2] BUG splat: kernel BUG at include/linux/mm.h:1529! monitor event: 0040 ilc:2 [#1] SMP Modules linked in: rfkill sunrpc ghash_s390 prng xts aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 vfio_ccw sha512_s390 mdev vfio_iommu_type1 vfio xfs libcrc32c virtio_blk virtio_net net_failover failover dm_mirror dm_region_hash dm_log dm_mod pkey zcrypt CPU: 1 PID: 1401 Comm: shmem-fork-test Not tainted 6.0.0-rc2+ #20 Hardware name: IBM 8561 LT1 400 (KVM/Linux) Krnl PSW : 0704c00180000000 0000000014928240 (copy_pte_range+0xa40/0xe58) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 Krnl GPRS: 000003ff85b80000 000000000000000c 0000000000000000 000003ff85b80000 0000000091c5f31f 0000000087d70640 000000008160e800 00000372024717c0 000003ff85b80000 0000000000000000 00000000831c9c00 0000000091c5f31f 00000000823ada00 0000000087d70640 00000000149279c2 0000038000773880 Krnl Code: 0000000014928232: c0e5fffff48f brasl %r14,0000000014926b50 0000000014928238: a7f4fd43 brc 15,0000000014927cbe #000000001492823c: af000000 mc 0,0 >0000000014928240: b904005b lgr %r5,%r11 0000000014928244: a7f4ffde brc 15,0000000014928200 0000000014928248: e310f0e80004 lg %r1,232(%r15) 000000001492824e: a7f4ff17 brc 15,000000001492807c 0000000014928252: ec3800091c7c cgij %r3,28,8,0000000014928264 Call Trace: [<0000000014928240>] copy_pte_range+0xa40/0xe58 ([<00000000149279c2>] copy_pte_range+0x1c2/0xe58) [<000000001492e258>] copy_page_range+0x510/0x770 [<00000000146f3896>] dup_mmap+0x47e/0x6c0 [<00000000146f3b52>] dup_mm+0x7a/0x278 [<00000000146f5a48>] copy_process+0x1298/0x1a48 [<00000000146f62fe>] kernel_clone+0x5e/0x3c0 [<00000000146f6742>] __do_sys_clone+0x5a/0x68 [<00000000146f67e0>] __s390x_sys_clone+0x40/0x50 [<0000000014f68dac>] __do_syscall+0x1d4/0x200 [<0000000014f78c22>] system_call+0x82/0xb0 Last Breaking-Event-Address: [<0000000014927a46>] copy_pte_range+0x246/0xe58 Kernel panic - not syncing: Fatal exception: panic_on_oops [3] minimal reproducer: #include <unistd.h> #include <sys/shm.h> #include <sys/resource.h> int main() { int id; void *p; id = shmget(IPC_PRIVATE, 4096, IPC_CREAT); p = shmat(id, NULL, SHM_RDONLY); getrlimit(RLIMIT_AS, p); fork(); return 0; } [4] RFC patch: diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index 13449941516c..c12722da1558 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -418,6 +418,8 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) fault = VM_FAULT_BADACCESS; if (unlikely(!(vma->vm_flags & access))) goto out_up; + if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_WRITE)) + goto out_up; if (is_vm_hugetlb_page(vma)) address &= HPAGE_MASK;