* Suren Baghdasaryan <surenb@xxxxxxxxxx> [220310 18:31]: > On Thu, Mar 10, 2022 at 2:22 PM Liam Howlett <liam.howlett@xxxxxxxxxx> wrote: > > > > * Suren Baghdasaryan <surenb@xxxxxxxxxx> [220310 11:28]: > > > On Thu, Mar 10, 2022 at 7:55 AM Liam Howlett <liam.howlett@xxxxxxxxxx> wrote: > > > > > > > > * Suren Baghdasaryan <surenb@xxxxxxxxxx> [220225 00:51]: > > > > > On Thu, Feb 24, 2022 at 8:23 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > > > > > > > > > On Thu, Feb 24, 2022 at 08:18:59PM -0800, Andrew Morton wrote: > > > > > > > On Tue, 15 Feb 2022 12:19:22 -0800 Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > > > > > > > > otherwise it points to a vma that was freed and when reused leads to > > > > > > > > a use-after-free bug. > > > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > > > --- a/mm/mmap.c > > > > > > > > +++ b/mm/mmap.c > > > > > > > > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > > > > > > > > vma = remove_vma(vma); > > > > > > > > cond_resched(); > > > > > > > > } > > > > > > > > + mm->mmap = NULL; > > > > > > > > mmap_write_unlock(mm); > > > > > > > > vm_unacct_memory(nr_accounted); > > > > > > > > } > > > > > > > > > > > > > > After the Maple tree patches, mm_struct.mmap doesn't exist. So I'll > > > > > > > revert this fix as part of merging the maple-tree parts of linux-next. > > > > > > > I'll be sending this fix to Linus this week. > > > > > > > > > > > > > > All of which means that the thusly-resolved Maple tree patches might > > > > > > > reintroduce this use-after-free bug. > > > > > > > > > > > > I don't think so? The problem is that VMAs are (currently) part of > > > > > > two data structures -- the rbtree and the linked list. remove_vma() > > > > > > only removes VMAs from the rbtree; it doesn't set mm->mmap to NULL. > > > > > > > > > > > > With maple tree, the linked list goes away. remove_vma() removes VMAs > > > > > > from the maple tree. So anyone looking to iterate over all VMAs has to > > > > > > go and look in the maple tree for them ... and there's nothing there. > > > > > > > > > > Yes, I think you are right. With maple trees we don't need this fix. > > > > > > > > > > > > Yes, this is correct. The maple tree removes the entire linked list... > > > > but since the mm is unstable in the exit_mmap(), I had added the > > > > destruction of the maple tree there. Maybe this is the wrong place to > > > > be destroying the tree tracking the VMAs (althought this patch partially > > > > destroys the VMA tracking linked list), but it brought my attention to > > > > the race that this patch solves and the process_mrelease() function. > > > > Couldn't this be avoided by using mmget_not_zero() instead of mmgrab() > > > > in process_mrelease()? > > > > > > That's what we were doing before [1]. That unfortunately has a problem > > > of process_mrelease possibly calling the last mmput and being blocked > > > on IO completion in exit_aio. > > > > Oh, I see. Thanks. > > > > > > > The race between exit_mmap and > > > process_mrelease is solved by using mmap_lock. > > > > I think an important part of the race fix isn't just the lock holding > > but the setting of the start of the linked list to NULL above. That > > means the code in __oom_reap_task_mm() via process_mrelease() will > > continue to execute but iterate for zero VMAs. > > > > > I think by destroying the maple tree in exit_mmap before the > > > mmap_write_unlock call, you keep things working and functionality > > > intact. Is there any reason this can't be done? > > > > Yes, unfortunately. If MMF_OOM_SKIP is not set, then process_mrelease() > > will call __oom_reap_task_mm() which will get a null pointer dereference > > or a use after free in the vma iterator as it tries to iterate the maple > > tree. I think the best plan is to set MMF_OOM_SKIP unconditionally > > when the mmap_write_lock() is acquired. Doing so will ensure nothing > > will try to gain memory by reaping a task that no longer has memory to > > yield - or at least won't shortly. If we do use MMF_OOM_SKIP in such a > > way, then I think it is safe to quickly drop the lock? > > That technically would work but it changes the semantics of > MMF_OOM_SKIP flag from "mm is of no interest for the OOM killer" to > something like "mm is empty" akin to mm->mmap == NULL. Well, an empty mm is of no interest to the oom killer was my thought. > So, there is no way for maple tree to indicate that it is empty? On second look, the tree is part of the mm_struct. Destroying will clear the flags and remove all VMAs, but that should be fine as long as nothing tries to add anything back to the tree. I don't think there is a dereference issue here and it will continue to run through the motions on an empty set as it does right now. > > > > > Also, should process_mrelease() be setting MMF_OOM_VICTIM on this mm? > > It would enable the fast path on a race with exit_mmap() - thought that > > may not be desirable? > > Michal does not like that approach because again, process_mrelease is > not oom-killer to set MMF_OOM_VICTIM flag. Besides, we want to get rid > of that special mm_is_oom_victim(mm) branch inside exit_mmap. Which > reminds me to look into it again. > > > > > > > > > [1] ba535c1caf3ee78a ("mm/oom_kill: allow process_mrelease to run > > > under mmap_lock protection") > > > > > > > That would ensure we aren't stepping on an > > > > exit_mmap() and potentially the locking change in exit_mmap() wouldn't > > > > be needed either? Logically, I view this as process_mrelease() having > > > > issue with the fact that the mmaps are no longer stable in tear down > > > > regardless of the data structure that is used. > > > > > > > > Thanks, > > > > Liam > > > > > > > > -- > > > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@xxxxxxxxxxx. > > > > > > > > -- > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@xxxxxxxxxxx. > >