On 2024/7/2 12:29, Leah Rumancik wrote: > From: Miaohe Lin <linmiaohe@xxxxxxxxxx> > > commit 35e351780fa9d8240dd6f7e4f245f9ea37e96c19 upstream. > > Thorvald reported a WARNING [1]. And the root cause is below race: > > CPU 1 CPU 2 > fork hugetlbfs_fallocate > dup_mmap hugetlbfs_punch_hole > i_mmap_lock_write(mapping); > vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree. > i_mmap_unlock_write(mapping); > hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem! > i_mmap_lock_write(mapping); > hugetlb_vmdelete_list > vma_interval_tree_foreach > hugetlb_vma_trylock_write -- Vma_lock is cleared. > tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem! > hugetlb_vma_unlock_write -- Vma_lock is assigned!!! > i_mmap_unlock_write(mapping); > > hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside > i_mmap_rwsem lock while vma lock can be used in the same time. Fix this > by deferring linking file vma until vma is fully initialized. Those vmas > should be initialized first before they can be used. > > Backport notes: > > The first backport attempt (cec11fa2e) was reverted (dd782da4707). This is > the new backport of the original fix (35e351780fa9). > > 35e351780f ("fork: defer linking file vma until vma is fully initialized") > fixed a hugetlb locking race by moving a bunch of intialization code to earlier > in the function. The call to open() was included in the move but the call to > copy_page_range was not, effectively inverting their relative ordering. This > created an issue for the vfio code which assumes copy_page_range happens before > the call to open() - vfio's open zaps the vma so that the fault handler is > invoked later, but when we inverted the ordering, copy_page_range can set up > mappings post-zap which would prevent the fault handler from being invoked > later. This patch moves the call to copy_page_range to earlier than the call to > open() to restore the original ordering of the two functions while keeping the > fix for hugetlb intact. Thanks for your update! > > Commit aac6db75a9 made several changes to vfio_pci_core.c, including > removing the vfio-pci custom open function. This resolves the issue on > the main branch and so we only need to apply these changes when > backporting to stable branches. > > 35e351780f ("fork: defer linking file vma until vma is fully initialized")-> v6.9-rc5 > aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") -> v6.10-rc4 > > Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@xxxxxxxxxx > Fixes: 8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing") > Signed-off-by: Miaohe Lin <linmiaohe@xxxxxxxxxx> > Reported-by: Thorvald Natvig <thorvald@xxxxxxxxxx> > Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1] > Reviewed-by: Jane Chu <jane.chu@xxxxxxxxxx> > Cc: Christian Brauner <brauner@xxxxxxxxxx> > Cc: Heiko Carstens <hca@xxxxxxxxxxxxx> > Cc: Kent Overstreet <kent.overstreet@xxxxxxxxx> > Cc: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> > Cc: Mateusz Guzik <mjguzik@xxxxxxxxx> > Cc: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx> > Cc: Miaohe Lin <linmiaohe@xxxxxxxxxx> > Cc: Muchun Song <muchun.song@xxxxxxxxx> > Cc: Oleg Nesterov <oleg@xxxxxxxxxx> > Cc: Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> > Cc: Tycho Andersen <tandersen@xxxxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Signed-off-by: Miaohe Lin <linmiaohe@xxxxxxxxxx> > Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> > Signed-off-by: Leah Rumancik <leah.rumancik@xxxxxxxxx> > --- > kernel/fork.c | 27 +++++++++++++-------------- > 1 file changed, 13 insertions(+), 14 deletions(-) > > diff --git a/kernel/fork.c b/kernel/fork.c > index 177ce7438db6..122d2cd124d5 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -727,6 +727,19 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > } else if (anon_vma_fork(tmp, mpnt)) > goto fail_nomem_anon_vma_fork; > vm_flags_clear(tmp, VM_LOCKED_MASK); > + /* > + * Copy/update hugetlb private vma information. > + */ > + if (is_vm_hugetlb_page(tmp)) > + hugetlb_dup_vma_private(tmp); > + > + if (!(tmp->vm_flags & VM_WIPEONFORK) && > + copy_page_range(tmp, mpnt)) > + goto fail_nomem_vmi_store; > + > + if (tmp->vm_ops && tmp->vm_ops->open) > + tmp->vm_ops->open(tmp); > + > file = tmp->vm_file; > if (file) { > struct address_space *mapping = file->f_mapping; > @@ -743,25 +756,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > i_mmap_unlock_write(mapping); > } > > - /* > - * Copy/update hugetlb private vma information. > - */ > - if (is_vm_hugetlb_page(tmp)) > - hugetlb_dup_vma_private(tmp); > - > /* Link the vma into the MT */ > if (vma_iter_bulk_store(&vmi, tmp)) > goto fail_nomem_vmi_store; > > mm->map_count++; > - if (!(tmp->vm_flags & VM_WIPEONFORK)) > - retval = copy_page_range(tmp, mpnt); I have a vague memory that copy_page_range should be called after vma is inserted into the i_mmap tree. Or there might be a problem: dup_mmap remove_migration_ptes copy_page_range -- Child process copys migration entry from parent rmap_walk rmap_walk_file i_mmap_lock_read(mapping); vma_interval_tree_foreach remove_migration_pte -- The vma of child process is still invisible So migration entry lefts in the child process's address space. i_mmap_unlock_read(mapping); i_mmap_lock_write(mapping); vma_interval_tree_insert_after -- To late! Child process has stale migration entry left while migration is already done! i_mmap_unlock_write(mapping); But I'm not really sure. I might miss something. Thanks. .