Re: [PATCH v2 2/2] mm/hugetlb: support write-faults in shared mappings

Gerald Schaefer <gerald.schaefer@xxxxxxxxxxxxx> · Tue, 16 Aug 2022 11:33:59 +0200

On Mon, 15 Aug 2022 14:43:16 -0700
Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:

> On 08/15/22 20:38, Gerald Schaefer wrote:
> > On Mon, 15 Aug 2022 20:03:20 +0200
> > David Hildenbrand <david@xxxxxxxxxx> wrote:
> > > On Mon, Aug 15, 2022 at 5:59 PM Gerald Schaefer
> > > <gerald.schaefer@xxxxxxxxxxxxx> wrote:
> > > > On Mon, 15 Aug 2022 17:07:32 +0200
> > > > David Hildenbrand <david@xxxxxxxxxx> wrote:
> > > > > On Mon, Aug 15, 2022 at 3:36 PM Gerald Schaefer
> > > > > <gerald.schaefer@xxxxxxxxxxxxx> wrote:
> > > > > > On Thu, 11 Aug 2022 11:59:09 -0700
> > > > > > Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:
> > > > > >
> > > > Sure, forgot to send it with initial reply...
> > > >
> > > > [   82.574749] ------------[ cut here ]------------
> > > > [   82.574751] WARNING: CPU: 9 PID: 1674 at mm/hugetlb.c:5264 hugetlb_wp+0x3be/0x818
> > > > [   82.574759] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc uvdevice s390_trng vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
> > > > [   82.574785] CPU: 9 PID: 1674 Comm: linkhuge_rw Kdump: loaded Not tainted 5.19.0-next-20220815 #36
> > > > [   82.574787] Hardware name: IBM 3931 A01 704 (LPAR)
> > > > [   82.574788] Krnl PSW : 0704c00180000000 00000006c9d4bc6a (hugetlb_wp+0x3c2/0x818)
> > > > [   82.574791]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> > > > [   82.574794] Krnl GPRS: 000000000227c000 0000000008640071 0000000000000000 0000000001200000
> > > > [   82.574796]            0000000001200000 00000000b5a98090 0000000000000255 00000000adb2c898
> > > > [   82.574797]            0000000000000000 00000000adb2c898 0000000001200000 00000000b5a98090
> > > > [   82.574799]            000000008c408000 0000000092fd7300 000003800339bc10 000003800339baf8
> > > > [   82.574803] Krnl Code: 00000006c9d4bc5c: f160000407fe        mvo     4(7,%r0),2046(1,%r0)
> > > >            00000006c9d4bc62: 47000700           bc      0,1792
> > > >           #00000006c9d4bc66: af000000           mc      0,0
> > > >           >00000006c9d4bc6a: a7a80040           lhi     %r10,64
> > > >            00000006c9d4bc6e: b916002a           llgfr   %r2,%r10
> > > >            00000006c9d4bc72: eb6ff1600004       lmg     %r6,%r15,352(%r15)
> > > >            00000006c9d4bc78: 07fe               bcr     15,%r14
> > > >            00000006c9d4bc7a: 47000700           bc      0,1792
> > > > [   82.574814] Call Trace:
> > > > [   82.574842]  [<00000006c9d4bc6a>] hugetlb_wp+0x3c2/0x818
> > > > [   82.574846]  [<00000006c9d4c62e>] hugetlb_no_page+0x56e/0x5a8
> > > > [   82.574848]  [<00000006c9d4cac2>] hugetlb_fault+0x45a/0x590
> > > > [   82.574850]  [<00000006c9d06d4a>] handle_mm_fault+0x182/0x220
> > > > [   82.574855]  [<00000006c9a9d70e>] do_exception+0x19e/0x470
> > > > [   82.574858]  [<00000006c9a9dff2>] do_dat_exception+0x2a/0x50
> > > > [   82.574861]  [<00000006ca668a18>] __do_pgm_check+0xf0/0x1b0
> > > > [   82.574866]  [<00000006ca677b3c>] pgm_check_handler+0x11c/0x170
> > > > [   82.574870] Last Breaking-Event-Address:
> > > > [   82.574871]  [<00000006c9d4b926>] hugetlb_wp+0x7e/0x818
> > > > [   82.574873] Kernel panic - not syncing: panic_on_warn set ...
> > > > [   82.574875] CPU: 9 PID: 1674 Comm: linkhuge_rw Kdump: loaded Not tainted 5.19.0-next-20220815 #36
> > > > [   82.574877] Hardware name: IBM 3931 A01 704 (LPAR)
> > > > [   82.574878] Call Trace:
> > > > [   82.574879]  [<00000006ca664f22>] dump_stack_lvl+0x62/0x80
> > > > [   82.574881]  [<00000006ca657af8>] panic+0x118/0x300
> > > > [   82.574884]  [<00000006c9ac3da6>] __warn+0xb6/0x160
> > > > [   82.574887]  [<00000006ca29b1ea>] report_bug+0xba/0x140
> > > > [   82.574890]  [<00000006c9a75194>] monitor_event_exception+0x44/0x80
> > > > [   82.574892]  [<00000006ca668a18>] __do_pgm_check+0xf0/0x1b0
> > > > [   82.574894]  [<00000006ca677b3c>] pgm_check_handler+0x11c/0x170
> > > > [   82.574897]  [<00000006c9d4bc6a>] hugetlb_wp+0x3c2/0x818
> > > > [   82.574899]  [<00000006c9d4c62e>] hugetlb_no_page+0x56e/0x5a8
> > > > [   82.574901]  [<00000006c9d4cac2>] hugetlb_fault+0x45a/0x590
> > > > [   82.574903]  [<00000006c9d06d4a>] handle_mm_fault+0x182/0x220
> > > > [   82.574906]  [<00000006c9a9d70e>] do_exception+0x19e/0x470
> > > > [   82.574907]  [<00000006c9a9dff2>] do_dat_exception+0x2a/0x50
> > > > [   82.574909]  [<00000006ca668a18>] __do_pgm_check+0xf0/0x1b0
> > > > [   82.574912]  [<00000006ca677b3c>] pgm_check_handler+0x11c/0x170
> > > 
> > > 
> > > do_dat_exception() sets
> > >   access = VM_ACCESS_FLAGS;
> > > 
> > > do_exception() sets
> > >   is_write = (trans_exc_code & store_indication) == 0x400;
> > > 
> > > and FAULT_FLAG_WRITE
> > >    if (access == VM_WRITE || is_write)
> > >           flags |= FAULT_FLAG_WRITE;
> > > 
> > > however, for VMA permission checks it only checks
> > >   if (unlikely(!(vma->vm_flags & access)))
> > >           goto out_up;
> > > 
> > > as VM_ACCESS_FLAGS includes VM_WRITE | VM_READ ...
> > > 
> > > We end up triggering a write fault (FAULT_FLAG_WRITE), even though the
> > > VMA does not allow for writes.
> > > 
> > > I assume that's what happens and that it's a bug in s390x code.
> > > 
> > 
> > Hmm, that looks weird, but that doesn't mean it has to be broken.
> > We are talking about a pte_none() fault, not a protection exception
> > (do_dat_exception vs. do_protection_exception). Not sure if we get
> > any proper store indication in that case, but yes, this looks weird,
> > will have a closer look. Thanks for pointing out!
> > 
> > FWIW, meanwhile, I added a check to hugetlb_wp() in v5.19, for
> > (!unshare && !(vma->vm_flags & VM_WRITE)). This did not trigger,
> > however, it did trigger already before your commit. So something
> > already changed before your commit, and after v5.19.
> > 
> > Further bisecting showed that the check started to trigger
> > after commit bcd51a3c679d ("hugetlb: lazy page table copies in fork()"),
> > and after that the "HUGETLB_ELFMAP=R linkhuge_rw" testcase also
> > started segfaulting (not sure why we did not notice earlier...).
> > 
> > Anyway, I guess this means that your commit only made that change
> > in behavior more obvious, by adding the WARN_ON_ONCE, but it really
> > was introduced by that other commit.
> > 
> > Not sure if this gives any more insight to anyone, still confused
> > by your comments on do_exception(), which also sound like a possible
> > root cause for ending up in hugetlb_wp() w/o VM_WRITE (but why only
> > after commit bcd51a3c679d?).
> 
> I know it doesn't mean much, but I did not/do not see these issues on x86.

Thanks, we were also trying to reproduce on x86, w/o success so far. But
I guess that matches David latest observations wrt to our exception handling
code on s390.

Good news is that the problem goes away when I add this simple patch, which
should result in proper VM_WRITE check for vma flags, before triggering a
FAULT_FLAG_WRITE fault:

--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -379,7 +379,9 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
        flags = FAULT_FLAG_DEFAULT;
        if (user_mode(regs))
                flags |= FAULT_FLAG_USER;
-       if (access == VM_WRITE || is_write)
+       if (is_write)
+               access = VM_WRITE;
+       if (access == VM_WRITE)
                flags |= FAULT_FLAG_WRITE;
        mmap_read_lock(mm);
 
Still find it a bit hard to believe that this > 10 years old logic really
is/was broken all the time. I guess it simply did not matter for normal
PTE faults, probably because the common fault handling code later would
check itself via maybe_mkwrite(). And for hugetlb PTEs, it might not have
mattered before commit bcd51a3c679d.

> 
> bcd51a3c679d eliminates the copying of page tables at fork for non-anon
> hugetlb vmas.  So, in these tests you would likely see more pte_none()
> faults.

Yes, makes sense, assuming now that it actually is related to s390
exception handling code, not checking for VM_WRITE before triggering a
write fault for pte_none().

Thanks for checking! And Thanks a lot to David for finding that issue
in s390 exception handling code!