Re: [syzbot] [mm?] BUG: unable to handle kernel paging request in vma_merge_existing_range

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Fri, 21 Mar 2025 09:10:45 +0000

TL;DR: After actually sleeping :)

Original clear all implementation only tried to merge, so the split bits of
vma_modify() for it are no-ops (we iterate VMAs explicitly and try to clear the
whole of each VMA).

So we can simply go with a mix of Jann's and Pedro's suggestions and pass in a
flag that says 'I'm fine with this OOM'ing on the attempt, if that happens just
give up on the merge and let's reset the VMA in question.

I will write a patch for this shortly.

Thanks guys!

On Fri, Mar 21, 2025 at 12:04:42AM +0100, Jann Horn wrote:
> On Thu, Mar 20, 2025 at 9:53 PM Lorenzo Stoakes
> <lorenzo.stoakes@xxxxxxxxxx> wrote:
> > On Thu, Mar 20, 2025 at 09:11:33PM +0100, Jann Horn wrote:
> > > On Thu, Mar 20, 2025 at 9:02 PM Pedro Falcato <pfalcato@xxxxxxx> wrote:
> > > > On Thu, Mar 20, 2025 at 12:09:36PM -0700, syzbot wrote:
> [...]
> > > > Ahh, fun bug. This *seems* to be the bug:
> > > >
> > > > First, in vma_modify:
> > > >
> > > >         merged = vma_merge_existing_range(vmg);
> > > >         if (merged)
> > > >                 return merged;
> > > >         if (vmg_nomem(vmg))
> > > >                 return ERR_PTR(-ENOMEM);
> > > >
> > > > then, all the way up to userfaultfd_release_all (the return value propagates
> > > > vma_modify -> vma_modify_flags_uffd -> userfaultfd_clear_vma):
> [...]
> > > > diff --git a/mm/vma.c b/mm/vma.c
> > > > index 71ca012c616c..b2167b7dc27d 100644
> > > > --- a/mm/vma.c
> > > > +++ b/mm/vma.c
> > > > @@ -1517,8 +1517,16 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg)
> > > >         merged = vma_merge_existing_range(vmg);
> > > >         if (merged)
> > > >                 return merged;
> > > > -       if (vmg_nomem(vmg))
> > > > +       if (vmg_nomem(vmg)) {
> > > > +               /* If we can avoid failing the whole modification
> > > > +                * due to a merge OOM and validly keep going
> > > > +                * (we're modifying the whole VMA), return vma intact.
> > > > +                * It won't get merged, but such is life - we're avoiding
> > > > +                * OOM conditions in other parts of mm/ this way */
> > > > +               if (start <= vma->vm_start && end >= vma->vm_end)
> > > > +                       return vma;
> >
> > I do not like this solution at all, sorry.
> >
> > I mean I get what you're doing and it's smart to try to find a means out of
> > this in general :) but let me explain my reasoning:
> >
> > For one this is uffd's fault, and the fix clearly needs to be there.
>
> I mean... this worked fine back in, for example, 5.4 -
> userfaultfd_release() would loop over the VMAs, change some stuff in
> some of them, and merge them where possible, and there was no way
> anything could fail. It's the VMA subsystem that changed its API...

OK you make a fair point, actually (after some sleep :P sorry was running on 3
hours yesterday).

On reflection adding an option to the merge that says 'if we can't allocate,
give up on merge and return NULL' is sensible, and we can add a parameter to
vma_modify() saying the same thing, should split fail, which is another error
case that must be considered here.

So this code path is going vma-by-vma so will never split, we need not worry
about this.

The original implementation just tried to merge, I made it so we shared code
instead of duplicating all over the place as we do the same operation only we
might be within a VMA in userfaultfd_unregister().

So this is really simple actaully we just need a 'ok best effort try to merge,
we don't want to know if OOM, if you go OOM there, give up'.

But for aforementioned reasons about implicit assumptions this _has_ to be an
explicit flag.

Which I will add :)

>
> > But also, we _can't be sure_ vma is valid any more. The second it goes off
> > to vma_merge_existing_range() it might be removed, which is why it's
> > critical to only use 'merged'.
> >
> > Now you might be able to prove that _right now_ it'd be ok, if you do this
> > check, because vma_complete() does the delete and only if either
> > vma_iter_prealloc() or dup_anon_vma() fails would we return -ENOMEM and
> > these happen _before_ vma_complete(), but that's an _implementation detail_
> > and now we've made an assumption that this is the case here.
> >
> > An implicit effectively precondition on something unexpressed like that is
> > asking for trouble, really don't want to go that way.
> >
> >
> > > >                 return ERR_PTR(-ENOMEM);
> > > > +       }
> >
> > >
> > > Along the lines of your idea, perhaps we could add a parameter "bool
> > > never_fail" to vma_modify() that is passed through to
> > > vma_merge_existing_range(), and guarantee that it never fails when
> > > that parameter is set? Then we could also check that never_fail is
> > > only used in cases where no split is necessary. That somewhat avoids
> > > having this kind of check that only ever runs in error conditions...
> >
> > Yeah hmmm, again this is _not where the problem is_. And we're doing it for
> > _one case only_, who _must_ succeed. Right?
>
> It seems to me like it is... theoretically very reasonable of
> userfaultfd to expect to have a "reliably change the flags of an
> entire VMA" operation, and if the normal VMA code doesn't provide that
> because of maple tree internals in the merging case, then it would be
> reasonable for the VMA code to provide an alternative that does
> provide this?

Yes.

>
> > Buuuut then again, we could add a _feature_ (it'd be something in VMG not a
> > bool, hey what are helper structs for right? :P)
> >
> > We coould add a special mode that says we __GFP_NOFAIL, we do that in
> > vms_abort_munmap_vmas() and man that was under similar circumstances (hey
> > remember that fun Liam? :)
> >
> > But at the same time, it feels icky, and we probably don't want to
> > proliferate this pattern too much.
> >
> > So I'd kind of rather prefer a _general_ no-fail unmap that the uffd
> > release code could invoke.
> >
> > Perhaps we could genericise the vms_abort_munmap_vmas():
> >
> >         mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL);
> >
> > And make that available or some form of it, to do the 'simplest' thing in
> > this scenario.
>
> The userfaultfd release code doesn't want an "unmap" operation. It
> just wants to remove the __VM_UFFD_FLAGS flags and set the
> vma->vm_userfaultfd_ctx pointer to NULL.
> The VMA code then sometimes sees an opportunity to merge with adjacent
> VMAs; and this merging is what's failing.
> So if we're willing to tolerate having adjacent VMAs that are
> mergeable but aren't merged after an allocation failure, then instead
> of using __GFP_NOFAIL for the merge, we could also just ignore merge
> failures, at least when some "never fail" flag is set?

Yes agreed.

>
> [...]
> > Another possible solution is to add a flag that _explicitly asserts and
> > documents_ that you require that no VMA be removed before attempting to
> > allocate.
> >
> > Or we could make that an _explicit_ assumption?
>
> I don't think I understand this part. What VMA removal and allocation
> are you talking about?
>
> > And then the uffd code itself could cache vma and take Pedro's solution on
> > that basis?
> >
> > void userfaultfd_release_all(struct mm_struct *mm,
> >                              struct userfaultfd_ctx *ctx)
> > {
> >         ...
> >
> >                 for_each_vma(vmi, vma) {
> >                         struct vm_area_struct *old = vma;
> >
> >                         ...
> >
> >                         vma = userfaultfd_clear_vma(&vmi, prev, vma,
> >                                     vma->vm_start, vma->vm_end);
> >                         if (IS_ERR(vma)) {
> >                                 BUG_ON(vma != -ENOMEM); /* Sorry Linus! */
> >
> >                                 /*
> >                                 * OK we assert above that vma must remain intact if we fail to allocate,
> >                                 * We are in an extreme memory pressure state, we simply cannot clear this VMA. Move on.
> >                                 */
> >                                 prev = old;
> >                         }
> >
> >                         ...
> >                 }
> > }
> >
> > I mean it's going to be dirty whichever way round we do this.
> >
> > Thoughts guys?
>
> I guess my main thought on this is: I would prefer it if we keep any
> code that runs only in near-impossible cases as simple as possible,
> because issues in those codepaths will take longer to find.

Agreed.