TL;DR: After actually sleeping :) Original clear all implementation only tried to merge, so the split bits of vma_modify() for it are no-ops (we iterate VMAs explicitly and try to clear the whole of each VMA). So we can simply go with a mix of Jann's and Pedro's suggestions and pass in a flag that says 'I'm fine with this OOM'ing on the attempt, if that happens just give up on the merge and let's reset the VMA in question. I will write a patch for this shortly. Thanks guys! On Fri, Mar 21, 2025 at 12:04:42AM +0100, Jann Horn wrote: > On Thu, Mar 20, 2025 at 9:53 PM Lorenzo Stoakes > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > On Thu, Mar 20, 2025 at 09:11:33PM +0100, Jann Horn wrote: > > > On Thu, Mar 20, 2025 at 9:02 PM Pedro Falcato <pfalcato@xxxxxxx> wrote: > > > > On Thu, Mar 20, 2025 at 12:09:36PM -0700, syzbot wrote: > [...] > > > > Ahh, fun bug. This *seems* to be the bug: > > > > > > > > First, in vma_modify: > > > > > > > > merged = vma_merge_existing_range(vmg); > > > > if (merged) > > > > return merged; > > > > if (vmg_nomem(vmg)) > > > > return ERR_PTR(-ENOMEM); > > > > > > > > then, all the way up to userfaultfd_release_all (the return value propagates > > > > vma_modify -> vma_modify_flags_uffd -> userfaultfd_clear_vma): > [...] > > > > diff --git a/mm/vma.c b/mm/vma.c > > > > index 71ca012c616c..b2167b7dc27d 100644 > > > > --- a/mm/vma.c > > > > +++ b/mm/vma.c > > > > @@ -1517,8 +1517,16 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg) > > > > merged = vma_merge_existing_range(vmg); > > > > if (merged) > > > > return merged; > > > > - if (vmg_nomem(vmg)) > > > > + if (vmg_nomem(vmg)) { > > > > + /* If we can avoid failing the whole modification > > > > + * due to a merge OOM and validly keep going > > > > + * (we're modifying the whole VMA), return vma intact. > > > > + * It won't get merged, but such is life - we're avoiding > > > > + * OOM conditions in other parts of mm/ this way */ > > > > + if (start <= vma->vm_start && end >= vma->vm_end) > > > > + return vma; > > > > I do not like this solution at all, sorry. > > > > I mean I get what you're doing and it's smart to try to find a means out of > > this in general :) but let me explain my reasoning: > > > > For one this is uffd's fault, and the fix clearly needs to be there. > > I mean... this worked fine back in, for example, 5.4 - > userfaultfd_release() would loop over the VMAs, change some stuff in > some of them, and merge them where possible, and there was no way > anything could fail. It's the VMA subsystem that changed its API... OK you make a fair point, actually (after some sleep :P sorry was running on 3 hours yesterday). On reflection adding an option to the merge that says 'if we can't allocate, give up on merge and return NULL' is sensible, and we can add a parameter to vma_modify() saying the same thing, should split fail, which is another error case that must be considered here. So this code path is going vma-by-vma so will never split, we need not worry about this. The original implementation just tried to merge, I made it so we shared code instead of duplicating all over the place as we do the same operation only we might be within a VMA in userfaultfd_unregister(). So this is really simple actaully we just need a 'ok best effort try to merge, we don't want to know if OOM, if you go OOM there, give up'. But for aforementioned reasons about implicit assumptions this _has_ to be an explicit flag. Which I will add :) > > > But also, we _can't be sure_ vma is valid any more. The second it goes off > > to vma_merge_existing_range() it might be removed, which is why it's > > critical to only use 'merged'. > > > > Now you might be able to prove that _right now_ it'd be ok, if you do this > > check, because vma_complete() does the delete and only if either > > vma_iter_prealloc() or dup_anon_vma() fails would we return -ENOMEM and > > these happen _before_ vma_complete(), but that's an _implementation detail_ > > and now we've made an assumption that this is the case here. > > > > An implicit effectively precondition on something unexpressed like that is > > asking for trouble, really don't want to go that way. > > > > > > > > return ERR_PTR(-ENOMEM); > > > > + } > > > > > > > > Along the lines of your idea, perhaps we could add a parameter "bool > > > never_fail" to vma_modify() that is passed through to > > > vma_merge_existing_range(), and guarantee that it never fails when > > > that parameter is set? Then we could also check that never_fail is > > > only used in cases where no split is necessary. That somewhat avoids > > > having this kind of check that only ever runs in error conditions... > > > > Yeah hmmm, again this is _not where the problem is_. And we're doing it for > > _one case only_, who _must_ succeed. Right? > > It seems to me like it is... theoretically very reasonable of > userfaultfd to expect to have a "reliably change the flags of an > entire VMA" operation, and if the normal VMA code doesn't provide that > because of maple tree internals in the merging case, then it would be > reasonable for the VMA code to provide an alternative that does > provide this? Yes. > > > Buuuut then again, we could add a _feature_ (it'd be something in VMG not a > > bool, hey what are helper structs for right? :P) > > > > We coould add a special mode that says we __GFP_NOFAIL, we do that in > > vms_abort_munmap_vmas() and man that was under similar circumstances (hey > > remember that fun Liam? :) > > > > But at the same time, it feels icky, and we probably don't want to > > proliferate this pattern too much. > > > > So I'd kind of rather prefer a _general_ no-fail unmap that the uffd > > release code could invoke. > > > > Perhaps we could genericise the vms_abort_munmap_vmas(): > > > > mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL); > > > > And make that available or some form of it, to do the 'simplest' thing in > > this scenario. > > The userfaultfd release code doesn't want an "unmap" operation. It > just wants to remove the __VM_UFFD_FLAGS flags and set the > vma->vm_userfaultfd_ctx pointer to NULL. > The VMA code then sometimes sees an opportunity to merge with adjacent > VMAs; and this merging is what's failing. > So if we're willing to tolerate having adjacent VMAs that are > mergeable but aren't merged after an allocation failure, then instead > of using __GFP_NOFAIL for the merge, we could also just ignore merge > failures, at least when some "never fail" flag is set? Yes agreed. > > [...] > > Another possible solution is to add a flag that _explicitly asserts and > > documents_ that you require that no VMA be removed before attempting to > > allocate. > > > > Or we could make that an _explicit_ assumption? > > I don't think I understand this part. What VMA removal and allocation > are you talking about? > > > And then the uffd code itself could cache vma and take Pedro's solution on > > that basis? > > > > void userfaultfd_release_all(struct mm_struct *mm, > > struct userfaultfd_ctx *ctx) > > { > > ... > > > > for_each_vma(vmi, vma) { > > struct vm_area_struct *old = vma; > > > > ... > > > > vma = userfaultfd_clear_vma(&vmi, prev, vma, > > vma->vm_start, vma->vm_end); > > if (IS_ERR(vma)) { > > BUG_ON(vma != -ENOMEM); /* Sorry Linus! */ > > > > /* > > * OK we assert above that vma must remain intact if we fail to allocate, > > * We are in an extreme memory pressure state, we simply cannot clear this VMA. Move on. > > */ > > prev = old; > > } > > > > ... > > } > > } > > > > I mean it's going to be dirty whichever way round we do this. > > > > Thoughts guys? > > I guess my main thought on this is: I would prefer it if we keep any > code that runs only in near-impossible cases as simple as possible, > because issues in those codepaths will take longer to find. Agreed.