The patch titled Subject: mm/khugepaged: write-lock VMA while collapsing a huge page has been added to the -mm mm-unstable branch. Its filename is mm-khugepaged-write-lock-vma-while-collapsing-a-huge-page.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-khugepaged-write-lock-vma-while-collapsing-a-huge-page.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Suren Baghdasaryan <surenb@xxxxxxxxxx> Subject: mm/khugepaged: write-lock VMA while collapsing a huge page Date: Mon, 27 Feb 2023 09:36:14 -0800 Protect VMA from concurrent page fault handler while collapsing a huge page. Page fault handler needs a stable PMD to use PTL and relies on per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(), set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will not be detected by a page fault handler without proper locking. Before this patch, page tables can be walked under any one of the mmap_lock, the mapping lock, and the anon_vma lock; so when khugepaged unlinks and frees page tables, it must ensure that all of those either are locked or don't exist. This patch adds a fourth lock under which page tables can be traversed, and so khugepaged must also lock out that one. Link: https://lkml.kernel.org/r/20230227173632.3292573-16-surenb@xxxxxxxxxx Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- --- a/mm/khugepaged.c~mm-khugepaged-write-lock-vma-while-collapsing-a-huge-page +++ a/mm/khugepaged.c @@ -1147,6 +1147,7 @@ static int collapse_huge_page(struct mm_ if (result != SCAN_SUCCEED) goto out_up_write; + vma_start_write(vma); anon_vma_lock_write(vma->anon_vma); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, @@ -1614,6 +1615,9 @@ int collapse_pte_mapped_thp(struct mm_st goto drop_hpage; } + /* Lock the vma before taking i_mmap and page table locks */ + vma_start_write(vma); + /* * We need to lock the mapping so that from here on, only GUP-fast and * hardware page walks can access the parts of the page tables that @@ -1819,6 +1823,7 @@ static int retract_page_tables(struct ad result = SCAN_PTE_UFFD_WP; goto unlock_next; } + vma_start_write(vma); collapse_and_free_pmd(mm, vma, addr, pmd); if (!cc->is_khugepaged && is_target) result = set_huge_pmd(vma, addr, pmd, hpage); --- a/mm/rmap.c~mm-khugepaged-write-lock-vma-while-collapsing-a-huge-page +++ a/mm/rmap.c @@ -25,21 +25,22 @@ * mapping->invalidate_lock (in filemap_fault) * page->flags PG_locked (lock_page) * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below) - * mapping->i_mmap_rwsem - * anon_vma->rwsem - * mm->page_table_lock or pte_lock - * swap_lock (in swap_duplicate, swap_info_get) - * mmlist_lock (in mmput, drain_mmlist and others) - * mapping->private_lock (in block_dirty_folio) - * folio_lock_memcg move_lock (in block_dirty_folio) - * i_pages lock (widely used) - * lruvec->lru_lock (in folio_lruvec_lock_irq) - * inode->i_lock (in set_page_dirty's __mark_inode_dirty) - * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) - * sb_lock (within inode_lock in fs/fs-writeback.c) - * i_pages lock (widely used, in set_page_dirty, - * in arch-dependent flush_dcache_mmap_lock, - * within bdi.wb->list_lock in __sync_single_inode) + * vma_start_write + * mapping->i_mmap_rwsem + * anon_vma->rwsem + * mm->page_table_lock or pte_lock + * swap_lock (in swap_duplicate, swap_info_get) + * mmlist_lock (in mmput, drain_mmlist and others) + * mapping->private_lock (in block_dirty_folio) + * folio_lock_memcg move_lock (in block_dirty_folio) + * i_pages lock (widely used) + * lruvec->lru_lock (in folio_lruvec_lock_irq) + * inode->i_lock (in set_page_dirty's __mark_inode_dirty) + * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) + * sb_lock (within inode_lock in fs/fs-writeback.c) + * i_pages lock (widely used, in set_page_dirty, + * in arch-dependent flush_dcache_mmap_lock, + * within bdi.wb->list_lock in __sync_single_inode) * * anon_vma->rwsem,mapping->i_mmap_rwsem (memory_failure, collect_procs_anon) * ->tasklist_lock _ Patches currently in -mm which might be from surenb@xxxxxxxxxx are mm-introduce-config_per_vma_lock.patch mm-move-mmap_lock-assert-function-definitions.patch mm-add-per-vma-lock-and-helper-functions-to-control-it.patch mm-mark-vma-as-being-written-when-changing-vm_flags.patch mm-mmap-move-vma_prepare-before-vma_adjust_trans_huge.patch mm-khugepaged-write-lock-vma-while-collapsing-a-huge-page.patch mm-mmap-write-lock-vmas-in-vma_prepare-before-modifying-them.patch mm-mremap-write-lock-vma-while-remapping-it-to-a-new-address-range.patch mm-write-lock-vmas-before-removing-them-from-vma-tree.patch mm-conditionally-write-lock-vma-in-free_pgtables.patch kernel-fork-assert-no-vma-readers-during-its-destruction.patch mm-mmap-prevent-pagefault-handler-from-racing-with-mmu_notifier-registration.patch mm-introduce-vma-detached-flag.patch mm-introduce-lock_vma_under_rcu-to-be-used-from-arch-specific-code.patch mm-fall-back-to-mmap_lock-if-vma-anon_vma-is-not-yet-set.patch mm-add-fault_flag_vma_lock-flag.patch mm-prevent-do_swap_page-from-handling-page-faults-under-vma-lock.patch mm-prevent-userfaults-to-be-handled-under-per-vma-lock.patch mm-introduce-per-vma-lock-statistics.patch x86-mm-try-vma-lock-based-page-fault-handling-first.patch arm64-mm-try-vma-lock-based-page-fault-handling-first.patch mm-mmap-free-vm_area_struct-without-call_rcu-in-exit_mmap.patch mm-separate-vma-lock-from-vm_area_struct.patch per-vma-locks.patch