Re: [syzbot] [mm?] BUG: unable to handle kernel paging request in vma_merge_existing_range

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Thu, 20 Mar 2025 20:53:13 +0000

On this - urgency is less as vmg_nomem() really can only occur with fault
injection afaict as it is a 'too small to fail' scenario, so we won't see
this IRL.

So we can relax a bit given LSF timing etc, and do something for 6.15rc1
I'd say :) then we can backport as needed obviously.

Also if this did somehow happen under such extreme memory pressure the
process would soon be torn down along with the system...

Sorry Jann, no security flaw here I would say :P

On Thu, Mar 20, 2025 at 09:11:33PM +0100, Jann Horn wrote:
> On Thu, Mar 20, 2025 at 9:02 PM Pedro Falcato <pfalcato@xxxxxxx> wrote:
> > On Thu, Mar 20, 2025 at 12:09:36PM -0700, syzbot wrote:
> > > Hello,
> > >
> > > syzbot found the following issue on:
> > >
> > > HEAD commit:    eb88e6bfbc0a Merge tag 'fsnotify_for_v6.14-rc7' of git://g..
> > > git tree:       upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=11e6c83f980000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=77423669c2b8fa9
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=20ed41006cf9d842c2b5
> > > compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> > > userspace arch: i386
> > >
> > > Unfortunately, I don't have any reproducer for this issue yet.
> > >
> > > Downloadable assets:
> > > disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7feb34a89c2a/non_bootable_disk-eb88e6bf.raw.xz
> > > vmlinux: https://storage.googleapis.com/syzbot-assets/ded0ce69669f/vmlinux-eb88e6bf.xz
> > > kernel image: https://storage.googleapis.com/syzbot-assets/6e6fa3c719e7/bzImage-eb88e6bf.xz
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+20ed41006cf9d842c2b5@xxxxxxxxxxxxxxxxxxxxxxxxx
> > >
> > > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > > R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
> > > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > >  </TASK>
> > > BUG: unable to handle page fault for address: fffffffffffffff4
> > > #PF: supervisor read access in kernel mode
> > > #PF: error_code(0x0000) - not-present page
> > > PGD df84067 P4D df84067 PUD df86067 PMD 0
> > > Oops: Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > > CPU: 1 UID: 0 PID: 17805 Comm: syz.8.3237 Not tainted 6.14.0-rc6-syzkaller-00212-geb88e6bfbc0a #0
> > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> > > RIP: 0010:vma_merge_existing_range+0x266/0x2070 mm/vma.c:734
> > > Code: e8 5f 25 ad ff 48 8b 14 24 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 1c 19 00 00 48 8b 04 24 48 8b 74 24 08 <4c> 8b 38 4c 89 ff e8 9f 1f ad ff 48 8b 44 24 08 49 39 c7 0f 83 db
> > > RSP: 0000:ffffc9000319f988 EFLAGS: 00010246
> > > RAX: fffffffffffffff4 RBX: ffffc9000319fae8 RCX: ffffffff820cd3e5
> > > RDX: 1ffffffffffffffe RSI: 0000000080c2a000 RDI: 0000000000000005
> > > RBP: 0000000080ce2000 R08: 0000000000000005 R09: 0000000000000000
> > > R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
> > > R13: ffffc9000319fb08 R14: ffff888025eddc98 R15: ffff88804eec0a00
> > > FS:  0000000000000000(0000) GS:ffff88802b500000(0063) knlGS:00000000f5106b40
> > > CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> > > CR2: fffffffffffffff4 CR3: 00000000614d6000 CR4: 0000000000352ef0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > >  <TASK>
> > >  vma_modify.constprop.0+0x87/0x410 mm/vma.c:1517
> > >  vma_modify_flags_uffd+0x241/0x2e0 mm/vma.c:1598
> > >  userfaultfd_clear_vma+0x91/0x130 mm/userfaultfd.c:1906
> > >  userfaultfd_release_all+0x2ae/0x4c0 mm/userfaultfd.c:2024
> > >  userfaultfd_release+0xf4/0x1c0 fs/userfaultfd.c:865
> > >  __fput+0x3ff/0xb70 fs/file_table.c:464
> > >  task_work_run+0x14e/0x250 kernel/task_work.c:227
> > >  resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
> > >  exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
> > >  exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
> > >  __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
> > >  syscall_exit_to_user_mode+0x27b/0x2a0 kernel/entry/common.c:218
> > >  __do_fast_syscall_32+0x80/0x120 arch/x86/entry/common.c:390
> > >  do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:412
> > >  entry_SYSENTER_compat_after_hwframe+0x84/0x8e
> > > RIP: 0023:0xf7fe6579
> > > Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
> > > RSP: 002b:00000000f510655c EFLAGS: 00000296 ORIG_RAX: 0000000000000135
> > > RAX: 0000000000000001 RBX: 0000000080000180 RCX: 0000000000000001
> > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > > R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
> > > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > >  </TASK>
> > > Modules linked in:
> > > CR2: fffffffffffffff4
> > > ---[ end trace 0000000000000000 ]---
> > > RIP: 0010:vma_merge_existing_range+0x266/0x2070 mm/vma.c:734
> > > Code: e8 5f 25 ad ff 48 8b 14 24 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 1c 19 00 00 48 8b 04 24 48 8b 74 24 08 <4c> 8b 38 4c 89 ff e8 9f 1f ad ff 48 8b 44 24 08 49 39 c7 0f 83 db
> > > RSP: 0000:ffffc9000319f988 EFLAGS: 00010246
> > > RAX: fffffffffffffff4 RBX: ffffc9000319fae8 RCX: ffffffff820cd3e5
> > > RDX: 1ffffffffffffffe RSI: 0000000080c2a000 RDI: 0000000000000005
> > > RBP: 0000000080ce2000 R08: 0000000000000005 R09: 0000000000000000
> > > R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
> > > R13: ffffc9000319fb08 R14: ffff888025eddc98 R15: ffff88804eec0a00
> > > FS:  0000000000000000(0000) GS:ffff88802b500000(0063) knlGS:00000000f5106b40
> > > CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> > > CR2: fffffffffffffff4 CR3: 00000000614d6000 CR4: 0000000000352ef0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > ----------------
> > > Code disassembly (best guess):
> > >    0: e8 5f 25 ad ff          call   0xffad2564
> > >    5: 48 8b 14 24             mov    (%rsp),%rdx
> > >    9: 48 b8 00 00 00 00 00    movabs $0xdffffc0000000000,%rax
> > >   10: fc ff df
> > >   13: 48 c1 ea 03             shr    $0x3,%rdx
> > >   17: 80 3c 02 00             cmpb   $0x0,(%rdx,%rax,1)
> > >   1b: 0f 85 1c 19 00 00       jne    0x193d
> > >   21: 48 8b 04 24             mov    (%rsp),%rax
> > >   25: 48 8b 74 24 08          mov    0x8(%rsp),%rsi
> > > * 2a: 4c 8b 38                mov    (%rax),%r15 <-- trapping instruction
> > >   2d: 4c 89 ff                mov    %r15,%rdi
> > >   30: e8 9f 1f ad ff          call   0xffad1fd4
> > >   35: 48 8b 44 24 08          mov    0x8(%rsp),%rax
> > >   3a: 49 39 c7                cmp    %rax,%r15
> > >   3d: 0f                      .byte 0xf
> > >   3e: 83                      .byte 0x83
> > >   3f: db                      .byte 0xdb
> >
> > Ahh, fun bug. This *seems* to be the bug:
> >
> > First, in vma_modify:
> >
> >         merged = vma_merge_existing_range(vmg);
> >         if (merged)
> >                 return merged;
> >         if (vmg_nomem(vmg))
> >                 return ERR_PTR(-ENOMEM);
> >
> > then, all the way up to userfaultfd_release_all (the return value propagates
> > vma_modify -> vma_modify_flags_uffd -> userfaultfd_clear_vma):

Oh lord... thanks for your analysis Pedro! But also, oh lord.

Yeah this 'pointer actually is an error value' thing is a thing I've seen
(and *ahem* caused) before.

It's funny because userfaultfd_clear_vma() -already has handling_ for errors...:

	if (!IS_ERR(ret))
		userfaultfd_reset_ctx(ret);

And yet...

> >
> >         prev = NULL;
> >         for_each_vma(vmi, vma) {
> >                 cond_resched();
> >                 BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
> >                        !!(vma->vm_flags & __VM_UFFD_FLAGS));
> >                 if (vma->vm_userfaultfd_ctx.ctx != ctx) {
> >                         prev = vma;
> >                         continue;
> >                 }
> >
> >                 vma = userfaultfd_clear_vma(&vmi, prev, vma,
> >                                             vma->vm_start, vma->vm_end);
> >                 prev = vma;
> >         }
> >
> > So, if uffd gets an IS_ERR(vma), it keeps going and takes that vma as the prev value,
> > which leads to that ERR_PTR(-ENOMEM) deref crash (-12 = -ENOMEM = 0xffffff4).
> > This situation is kind of awkward because ->release() errors don't mean a thing.
> > So, I have another idea (pasting for syzbot) which might just be cromulent.
> > Untested, but thoughts?

Yeah that seems to match the values, I did wonder about this 'underflowed
u64' value and whether it was an error one and yeah.. sigh.

> >
> > #syz test

> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index d06453fa8aba..fb835d82eb84 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -2023,6 +2023,8 @@ void userfaultfd_release_all(struct mm_struct *mm,
> >
> >                 vma = userfaultfd_clear_vma(&vmi, prev, vma,
> >                                             vma->vm_start, vma->vm_end);
> > +               if (WARN_ON(IS_ERR(vma)))
> > +                       break;
>
> If this WARN_ON() was ever actually hit, I think we'd leave dangling
> pointers in VMAs? As much as Linus hates BUG_ON(), I personally think
> that would be a situation warranting BUG_ON(), or at least
> CHECK_DATA_CORRUPTION(). That said:

Yeah indeed, agreed.

>
> >                 prev = vma;
> >         }
> >         mmap_write_unlock(mm);
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 71ca012c616c..b2167b7dc27d 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -1517,8 +1517,16 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg)
> >         merged = vma_merge_existing_range(vmg);
> >         if (merged)
> >                 return merged;
> > -       if (vmg_nomem(vmg))
> > +       if (vmg_nomem(vmg)) {
> > +               /* If we can avoid failing the whole modification
> > +                * due to a merge OOM and validly keep going
> > +                * (we're modifying the whole VMA), return vma intact.
> > +                * It won't get merged, but such is life - we're avoiding
> > +                * OOM conditions in other parts of mm/ this way */
> > +               if (start <= vma->vm_start && end >= vma->vm_end)
> > +                       return vma;

I do not like this solution at all, sorry.

I mean I get what you're doing and it's smart to try to find a means out of
this in general :) but let me explain my reasoning:

For one this is uffd's fault, and the fix clearly needs to be there.

But also, we _can't be sure_ vma is valid any more. The second it goes off
to vma_merge_existing_range() it might be removed, which is why it's
critical to only use 'merged'.

Now you might be able to prove that _right now_ it'd be ok, if you do this
check, because vma_complete() does the delete and only if either
vma_iter_prealloc() or dup_anon_vma() fails would we return -ENOMEM and
these happen _before_ vma_complete(), but that's an _implementation detail_
and now we've made an assumption that this is the case here.

An implicit effectively precondition on something unexpressed like that is
asking for trouble, really don't want to go that way.

> >                 return ERR_PTR(-ENOMEM);
> > +       }

>
> Along the lines of your idea, perhaps we could add a parameter "bool
> never_fail" to vma_modify() that is passed through to
> vma_merge_existing_range(), and guarantee that it never fails when
> that parameter is set? Then we could also check that never_fail is
> only used in cases where no split is necessary. That somewhat avoids
> having this kind of check that only ever runs in error conditions...

Yeah hmmm, again this is _not where the problem is_. And we're doing it for
_one case only_, who _must_ succeed. Right?

Buuuut then again, we could add a _feature_ (it'd be something in VMG not a
bool, hey what are helper structs for right? :P)

We coould add a special mode that says we __GFP_NOFAIL, we do that in
vms_abort_munmap_vmas() and man that was under similar circumstances (hey
remember that fun Liam? :)

But at the same time, it feels icky, and we probably don't want to
proliferate this pattern too much.

So I'd kind of rather prefer a _general_ no-fail unmap that the uffd
release code could invoke.

Perhaps we could genericise the vms_abort_munmap_vmas():

	mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL);

And make that available or some form of it, to do the 'simplest' thing in
this scenario.

If we called that say vma_emergency_clear() then we could do something
like:

void userfaultfd_release_all(struct mm_struct *mm,
			     struct userfaultfd_ctx *ctx)
{
	...

	for_each_vma(vmi, vma) {
		unsigned long start = vma->vm_start;
		unsigned long end = vma->vm_end;

		...

		vma = userfaultfd_clear_vma(&vmi, prev, vma,
					    vma->vm_start, vma->vm_end);
		if (IS_ERR(vma)) {
			/*
			* We can no longer rely on VMA state, we must
			* unconditionally leave a gap.
			*/
			vma_emergency_clear(mm, start, end);
			vma_iter_reset(&vmi); /* Probably? */
			prev = NULL; /* Probably? */
			continue;
		}
	}
}

I mean this isn't fun either. I wonder if we (that probably mean sme)
should go audit cases like this.

Another possible solution is to add a flag that _explicitly asserts and
documents_ that you require that no VMA be removed before attempting to
allocate.

Or we could make that an _explicit_ assumption?

And then the uffd code itself could cache vma and take Pedro's solution on
that basis?

void userfaultfd_release_all(struct mm_struct *mm,
			     struct userfaultfd_ctx *ctx)
{
	...

		for_each_vma(vmi, vma) {
			struct vm_area_struct *old = vma;

			...

			vma = userfaultfd_clear_vma(&vmi, prev, vma,
				    vma->vm_start, vma->vm_end);
			if (IS_ERR(vma)) {
				BUG_ON(vma != -ENOMEM); /* Sorry Linus! */

				/*
				* OK we assert above that vma must remain intact if we fail to allocate,
				* We are in an extreme memory pressure state, we simply cannot clear this VMA. Move on.
				*/
				prev = old;
			}

			...
		}
}

I mean it's going to be dirty whichever way round we do this.

Thoughts guys?

But key point being - this is not quite as urgent as it might seem :)