Re: [syzbot] [mm?] BUG: unable to handle kernel paging request in vma_merge_existing_range

Jann Horn <jannh@xxxxxxxxxx> · Fri, 21 Mar 2025 00:04:42 +0100

On Thu, Mar 20, 2025 at 9:53 PM Lorenzo Stoakes
<lorenzo.stoakes@xxxxxxxxxx> wrote:
> On Thu, Mar 20, 2025 at 09:11:33PM +0100, Jann Horn wrote:
> > On Thu, Mar 20, 2025 at 9:02 PM Pedro Falcato <pfalcato@xxxxxxx> wrote:
> > > On Thu, Mar 20, 2025 at 12:09:36PM -0700, syzbot wrote:
[...]
> > > Ahh, fun bug. This *seems* to be the bug:
> > >
> > > First, in vma_modify:
> > >
> > >         merged = vma_merge_existing_range(vmg);
> > >         if (merged)
> > >                 return merged;
> > >         if (vmg_nomem(vmg))
> > >                 return ERR_PTR(-ENOMEM);
> > >
> > > then, all the way up to userfaultfd_release_all (the return value propagates
> > > vma_modify -> vma_modify_flags_uffd -> userfaultfd_clear_vma):
[...]
> > > diff --git a/mm/vma.c b/mm/vma.c
> > > index 71ca012c616c..b2167b7dc27d 100644
> > > --- a/mm/vma.c
> > > +++ b/mm/vma.c
> > > @@ -1517,8 +1517,16 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg)
> > >         merged = vma_merge_existing_range(vmg);
> > >         if (merged)
> > >                 return merged;
> > > -       if (vmg_nomem(vmg))
> > > +       if (vmg_nomem(vmg)) {
> > > +               /* If we can avoid failing the whole modification
> > > +                * due to a merge OOM and validly keep going
> > > +                * (we're modifying the whole VMA), return vma intact.
> > > +                * It won't get merged, but such is life - we're avoiding
> > > +                * OOM conditions in other parts of mm/ this way */
> > > +               if (start <= vma->vm_start && end >= vma->vm_end)
> > > +                       return vma;
>
> I do not like this solution at all, sorry.
>
> I mean I get what you're doing and it's smart to try to find a means out of
> this in general :) but let me explain my reasoning:
>
> For one this is uffd's fault, and the fix clearly needs to be there.

I mean... this worked fine back in, for example, 5.4 -
userfaultfd_release() would loop over the VMAs, change some stuff in
some of them, and merge them where possible, and there was no way
anything could fail. It's the VMA subsystem that changed its API...

> But also, we _can't be sure_ vma is valid any more. The second it goes off
> to vma_merge_existing_range() it might be removed, which is why it's
> critical to only use 'merged'.
>
> Now you might be able to prove that _right now_ it'd be ok, if you do this
> check, because vma_complete() does the delete and only if either
> vma_iter_prealloc() or dup_anon_vma() fails would we return -ENOMEM and
> these happen _before_ vma_complete(), but that's an _implementation detail_
> and now we've made an assumption that this is the case here.
>
> An implicit effectively precondition on something unexpressed like that is
> asking for trouble, really don't want to go that way.
>
>
> > >                 return ERR_PTR(-ENOMEM);
> > > +       }
>
> >
> > Along the lines of your idea, perhaps we could add a parameter "bool
> > never_fail" to vma_modify() that is passed through to
> > vma_merge_existing_range(), and guarantee that it never fails when
> > that parameter is set? Then we could also check that never_fail is
> > only used in cases where no split is necessary. That somewhat avoids
> > having this kind of check that only ever runs in error conditions...
>
> Yeah hmmm, again this is _not where the problem is_. And we're doing it for
> _one case only_, who _must_ succeed. Right?

It seems to me like it is... theoretically very reasonable of
userfaultfd to expect to have a "reliably change the flags of an
entire VMA" operation, and if the normal VMA code doesn't provide that
because of maple tree internals in the merging case, then it would be
reasonable for the VMA code to provide an alternative that does
provide this?

> Buuuut then again, we could add a _feature_ (it'd be something in VMG not a
> bool, hey what are helper structs for right? :P)
>
> We coould add a special mode that says we __GFP_NOFAIL, we do that in
> vms_abort_munmap_vmas() and man that was under similar circumstances (hey
> remember that fun Liam? :)
>
> But at the same time, it feels icky, and we probably don't want to
> proliferate this pattern too much.
>
> So I'd kind of rather prefer a _general_ no-fail unmap that the uffd
> release code could invoke.
>
> Perhaps we could genericise the vms_abort_munmap_vmas():
>
>         mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL);
>
> And make that available or some form of it, to do the 'simplest' thing in
> this scenario.

The userfaultfd release code doesn't want an "unmap" operation. It
just wants to remove the __VM_UFFD_FLAGS flags and set the
vma->vm_userfaultfd_ctx pointer to NULL.
The VMA code then sometimes sees an opportunity to merge with adjacent
VMAs; and this merging is what's failing.
So if we're willing to tolerate having adjacent VMAs that are
mergeable but aren't merged after an allocation failure, then instead
of using __GFP_NOFAIL for the merge, we could also just ignore merge
failures, at least when some "never fail" flag is set?

[...]
> Another possible solution is to add a flag that _explicitly asserts and
> documents_ that you require that no VMA be removed before attempting to
> allocate.
>
> Or we could make that an _explicit_ assumption?

I don't think I understand this part. What VMA removal and allocation
are you talking about?

> And then the uffd code itself could cache vma and take Pedro's solution on
> that basis?
>
> void userfaultfd_release_all(struct mm_struct *mm,
>                              struct userfaultfd_ctx *ctx)
> {
>         ...
>
>                 for_each_vma(vmi, vma) {
>                         struct vm_area_struct *old = vma;
>
>                         ...
>
>                         vma = userfaultfd_clear_vma(&vmi, prev, vma,
>                                     vma->vm_start, vma->vm_end);
>                         if (IS_ERR(vma)) {
>                                 BUG_ON(vma != -ENOMEM); /* Sorry Linus! */
>
>                                 /*
>                                 * OK we assert above that vma must remain intact if we fail to allocate,
>                                 * We are in an extreme memory pressure state, we simply cannot clear this VMA. Move on.
>                                 */
>                                 prev = old;
>                         }
>
>                         ...
>                 }
> }
>
> I mean it's going to be dirty whichever way round we do this.
>
> Thoughts guys?

I guess my main thought on this is: I would prefer it if we keep any
code that runs only in near-impossible cases as simple as possible,
because issues in those codepaths will take longer to find.