On Tue, Aug 15, 2023 at 9:29 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote: > > * Jann Horn <jannh@xxxxxxxxxx> [230814 17:22]: > > On Mon, Aug 14, 2023 at 10:32 PM Liam R. Howlett > > <Liam.Howlett@xxxxxxxxxx> wrote: > > > * Jann Horn <jannh@xxxxxxxxxx> [230814 11:44]: > > > > @akpm > > > > > > > > On Mon, Jul 24, 2023 at 8:31 PM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote: > > > > > Since prev will be set later in the function, it is better to reverse > > > > > the splitting direction of the start VMA (modify the new_below argument > > > > > to __split_vma). > > > > > > > > It might be a good idea to reorder "mm: always lock new vma before > > > > inserting into vma tree" before this patch. > > > > > > > > If you apply this patch without "mm: always lock new vma before > > > > inserting into vma tree", I think move_vma(), when called with a start > > > > address in the middle of a VMA, will behave like this: > > > > > > > > - vma_start_write() [lock the VMA to be moved] > > > > - move_page_tables() [moves page table entries] > > > > - do_vmi_munmap() > > > > - do_vmi_align_munmap() > > > > - __split_vma() > > > > - creates a new VMA **covering the moved range** that is **not locked** > > > > - stores the new VMA in the VMA tree **without locking it** [1] > > > > - new VMA is locked and removed again [2] > > > > [...] > > > > > > > > So after the page tables in the region have already been moved, I > > > > believe there will be a brief window (between [1] and [2]) where page > > > > faults in the region can happen again, which could probably cause new > > > > page tables and PTEs to be created in the region again in that window. > > > > (This can't happen in Linus' current tree because the new VMA created > > > > by __split_vma() only covers the range that is not being moved.) > > > > > > Ah, so my reversing of which VMA to keep to the first split call opens a > > > window where the VMA being removed is not locked. Good catch. > > Looking at this again, I think it exists in Linus' tree and my change > actually removes this window: > > - error = __split_vma(vmi, vma, start, 0); > + error = __split_vma(vmi, vma, start, 1); > if (error) > goto start_split_failed; > > The last argument is "new_below", which means the new VMA will be at the > lower address. I don't love the argument of int or the name, also the > documentation is lacking for the split function. > > So, once we split at "start", vm_end = "start" in the new VMA while > start will be in the old VMA. I then lock the old vma to be removed > (again) and add it to the detached maple tree. > > Before my patch, we split the VMA and took the new unlocked VMA for > removal.. until I locked the new vma to be removed and add it to the > detached maple tree. So there is a window that we write the new split > VMA into the tree prior to locking the VMA, but it is locked before > removal. > > This change actually aligns the splitting with the other callers who use > the split_vma() wrapper. Oooh, you're right. Sorry, I misread that. > > > > > > > > > > > Though I guess that's not going to lead to anything bad, since > > > > do_vmi_munmap() anyway cleans up PTEs and page tables in the region? > > > > So maybe it's not that important. > > > > > > do_vmi_munmap() will clean up PTEs from the end of the previous VMA to > > > the start of the next > > > > Alright, I guess no action is needed here then. > > I don't see a difference between this and the race that exists after the > page fault ends and a task unmaps the area prior to the first task using > the faulted areas? Yeah, you're right. I was thinking about it in terms of "this is a weird situation and it would be dangerous if something relied on there not being any non-empty PTEs in the range", but there's nothing that relies on it, so that's fine.