On Mon, Sep 27, 2021 at 03:11:20AM -0700, Nadav Amit wrote: > > > On Sep 27, 2021, at 2:08 AM, Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote: > > > > On Sun, Sep 26, 2021 at 09:12:52AM -0700, Nadav Amit wrote: > >> From: Nadav Amit <namit@xxxxxxxxxx> > >> > >> The comment in madvise_dontneed_free() says that vma splits that occur > >> while the mmap-lock is dropped, during userfaultfd_remove(), should be > >> handled correctly, but nothing in the code indicates that it is so: prev > >> is invalidated, and do_madvise() will therefore continue to update VMAs > >> from the "obsolete" end (i.e., the one before the split). > >> > >> Propagate the changes to end from madvise_dontneed_free() back to > >> do_madvise() and continue the updates from the new end accordingly. > > > > Could you describe in details a race that would lead to wrong behaviour? > > Thanks for the quick response. > > For instance, madvise(MADV_DONTNEED) can race with mprotect() and cause > the VMA to split. > > Something like: > > CPU0 CPU1 > ---- ---- > madvise(0x10000, 0x2000, MADV_DONTNEED) > -> userfaultfd_remove() > [ mmap-lock dropped ] > mprotect(0x11000, 0x1000, PROT_READ) > [splitting the VMA] > > read(uffd) > [unblocking userfaultfd_remove()] > > [ resuming ] > end = vma->vm_end > [end == 0x11000] > > madvise_dontneed_single_vma(vma, 0x10000, 0x11000) > > Following this operation, 0x11000-0x12000 would not be zapped. Okay, fair enough. Wouldn't something like this work too: diff --git a/mm/madvise.c b/mm/madvise.c index 0734db8d53a7..0898120c5c04 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -796,6 +796,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, */ return -ENOMEM; } + *prev = vma; if (!can_madv_lru_vma(vma)) return -EINVAL; if (end > vma->vm_end) { -- Kirill A. Shutemov