The patch titled Subject: mm/vma: correctly position vma_iterator in __split_vma() has been added to the -mm mm-unstable branch. Its filename is mm-vma-correctly-position-vma_iterator-in-__split_vma.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-vma-correctly-position-vma_iterator-in-__split_vma.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx> Subject: mm/vma: correctly position vma_iterator in __split_vma() Date: Thu, 22 Aug 2024 15:25:23 -0400 It is now possible to walk the vma tree using the rcu read locks and is beneficial to do so to reduce lock contention. Doing so while a MAP_FIXED mapping is executing means that a reader may see a gap in the vma tree that should never logically exist - and does not when using the mmap lock in read mode. The temporal gap exists because mmap_region() calls munmap() prior to installing the new mapping. This patchset stops rcu readers from seeing the temporal gap by splitting up the munmap() function into two parts. The first part prepares the vma tree for modifications by doing the necessary splits and tracks the vmas marked for removal in a side tree. The second part completes the munmapping of the vmas after the vma tree has been overwritten (either by a MAP_FIXED replacement vma or by a NULL in the munmap() case). Please note that rcu walkers will still be able to see a temporary state of split vmas that may be in the process of being removed, but the temporal gap will not be exposed. vma_start_write() are called on both parts of the split vma, so this state is detectable. If existing vmas have a vm_ops->close(), then they will be called prior to mapping the new vmas (and ptes are cleared out). Without calling ->close(), hugetlbfs tests fail (hugemmap06 specifically) due to resources still being marked as 'busy'. Unfortunately, calling the corresponding ->open() may not restore the state of the vmas, so it is safer to keep the existing failure scenario where a gap is inserted and never replaced. The failure scenario is in its own patch (0015) for traceability. This patch (of 21): The vma iterator may be left pointing to the newly created vma. This happens when inserting the new vma at the end of the old vma (!new_below). The incorrect position in the vma iterator is not exposed currently since the vma iterator is repositioned in the munmap path and is not reused in any of the other paths. This has limited impact in the current code, but is required for future changes. Link: https://lkml.kernel.org/r/20240822192543.3359552-1-Liam.Howlett@xxxxxxxxxx Link: https://lkml.kernel.org/r/20240822192543.3359552-2-Liam.Howlett@xxxxxxxxxx Fixes: b2b3b886738f ("mm: don't use __vma_adjust() in __split_vma()") Signed-off-by: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> Reviewed-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> Reviewed-by: Lorenzo Stoakes <lstoakes@xxxxxxxxx> Cc: Bert Karwatzki <spasswolf@xxxxxx> Cc: Jiri Olsa <olsajiri@xxxxxxxxx> Cc: Kees Cook <kees@xxxxxxxxxx> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxx> Cc: Sidhartha Kumar <sidhartha.kumar@xxxxxxxxxx> Cc: Vlastimil Babka <vbabka@xxxxxxx> Cc: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> Cc: Paul Moore <paul@xxxxxxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/vma.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) --- a/mm/vma.c~mm-vma-correctly-position-vma_iterator-in-__split_vma +++ a/mm/vma.c @@ -177,7 +177,7 @@ void unmap_region(struct mm_struct *mm, /* * __split_vma() bypasses sysctl_max_map_count checking. We use this where it * has already been checked or doesn't make sense to fail. - * VMA Iterator will point to the end VMA. + * VMA Iterator will point to the original VMA. */ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long addr, int new_below) @@ -246,6 +246,9 @@ static int __split_vma(struct vma_iterat /* Success. */ if (new_below) vma_next(vmi); + else + vma_prev(vmi); + return 0; out_free_mpol: _ Patches currently in -mm which might be from Liam.Howlett@xxxxxxxxxx are maple_tree-remove-rcu_read_lock-from-mt_validate.patch mm-vma-correctly-position-vma_iterator-in-__split_vma.patch mm-vma-introduce-abort_munmap_vmas.patch mm-vma-introduce-vmi_complete_munmap_vmas.patch mm-vma-extract-the-gathering-of-vmas-from-do_vmi_align_munmap.patch mm-vma-introduce-vma_munmap_struct-for-use-in-munmap-operations.patch mm-vma-change-munmap-to-use-vma_munmap_struct-for-accounting-and-surrounding-vmas.patch mm-vma-change-munmap-to-use-vma_munmap_struct-for-accounting-and-surrounding-vmas-fix.patch mm-vma-extract-validate_mm-from-vma_complete.patch mm-vma-inline-munmap-operation-in-mmap_region.patch mm-vma-expand-mmap_region-munmap-call.patch mm-vma-support-vma-==-null-in-init_vma_munmap.patch mm-mmap-reposition-vma-iterator-in-mmap_region.patch mm-vma-track-start-and-end-for-munmap-in-vma_munmap_struct.patch mm-clean-up-unmap_region-argument-list.patch mm-mmap-avoid-zeroing-vma-tree-in-mmap_region.patch mm-change-failure-of-map_fixed-to-restoring-the-gap-on-failure.patch mm-mmap-use-phys_pfn-in-mmap_region.patch mm-mmap-use-vms-accounted-pages-in-mmap_region.patch ipc-shm-mm-drop-do_vma_munmap.patch mm-move-may_expand_vm-check-in-mmap_region.patch mm-vma-drop-incorrect-comment-from-vms_gather_munmap_vmas.patch mm-vmah-optimise-vma_munmap_struct.patch