On Mon, Sep 27, 2021 at 05:33:39AM -0700, Nadav Amit wrote: > > > > On Sep 27, 2021, at 4:55 AM, Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote: > > > > On Mon, Sep 27, 2021 at 03:11:20AM -0700, Nadav Amit wrote: > >> > >>> On Sep 27, 2021, at 2:08 AM, Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote: > >>> > >>> On Sun, Sep 26, 2021 at 09:12:52AM -0700, Nadav Amit wrote: > >>>> From: Nadav Amit <namit@xxxxxxxxxx> > >>>> > >>>> The comment in madvise_dontneed_free() says that vma splits that occur > >>>> while the mmap-lock is dropped, during userfaultfd_remove(), should be > >>>> handled correctly, but nothing in the code indicates that it is so: prev > >>>> is invalidated, and do_madvise() will therefore continue to update VMAs > >>>> from the "obsolete" end (i.e., the one before the split). > >>>> > >>>> Propagate the changes to end from madvise_dontneed_free() back to > >>>> do_madvise() and continue the updates from the new end accordingly. > >>> > >>> Could you describe in details a race that would lead to wrong behaviour? > >> > >> Thanks for the quick response. > >> > >> For instance, madvise(MADV_DONTNEED) can race with mprotect() and cause > >> the VMA to split. > >> > >> Something like: > >> > >> CPU0 CPU1 > >> ---- ---- > >> madvise(0x10000, 0x2000, MADV_DONTNEED) > >> -> userfaultfd_remove() > >> [ mmap-lock dropped ] > >> mprotect(0x11000, 0x1000, PROT_READ) > >> [splitting the VMA] > >> > >> read(uffd) > >> [unblocking userfaultfd_remove()] > >> > >> [ resuming ] > >> end = vma->vm_end > >> [end == 0x11000] > >> > >> madvise_dontneed_single_vma(vma, 0x10000, 0x11000) > >> > >> Following this operation, 0x11000-0x12000 would not be zapped. > > > > Okay, fair enough. > > > > Wouldn't something like this work too: > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 0734db8d53a7..0898120c5c04 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -796,6 +796,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > */ > > return -ENOMEM; > > } > > + *prev = vma; > > if (!can_madv_lru_vma(vma)) > > return -EINVAL; > > if (end > vma->vm_end) { > > Admittedly (embarrassingly?) I didn’t even consider it since all the > comments say that once the lock is dropped prev should be invalidated. > > Let’s see, considering the aforementioned scenario and that there is > initially one VMA between 0x10000-0x12000. > > Looking at the code from do_madvise()): > > [ end == 0x12000 ] > > tmp = vma->vm_end; > > [ tmp == 0x12000 ] > > if (end < tmp) > tmp = end; > > /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ > > error = madvise_vma(vma, &prev, start, tmp, behavior); > > [ prev->vm_end == 0x11000 after the split] > > if (error) > goto out; > start = tmp; > > [ start == 0x12000 ] > > if (prev && start < prev->vm_end) > start = prev->vm_end; > > [ The condition (start < prev->vm_end) is false, start not updated ] > > error = unmapped_error; > if (start >= end) > goto out; > > [ start >= end; so we end without updating the second part of the split ] > > So it does not work. > > Perhaps adding this one on top of yours? I can test it when I wake up. > It is cleaner, but I am not sure if I am missing something. It should work. BTW, shouldn't we bring madvise_willneed() and madvise_remove() to the same scheme? -- Kirill A. Shutemov