Re: [PATCH 6.6] fork: defer linking file vma until vma is fully initialized

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2024/7/2 12:29, Leah Rumancik wrote:
> From: Miaohe Lin <linmiaohe@xxxxxxxxxx>
> 
> commit 35e351780fa9d8240dd6f7e4f245f9ea37e96c19 upstream.
> 
> Thorvald reported a WARNING [1]. And the root cause is below race:
> 
>  CPU 1					CPU 2
>  fork					hugetlbfs_fallocate
>   dup_mmap				 hugetlbfs_punch_hole
>    i_mmap_lock_write(mapping);
>    vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
>    i_mmap_unlock_write(mapping);
>    hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!
> 					 i_mmap_lock_write(mapping);
>    					 hugetlb_vmdelete_list
> 					  vma_interval_tree_foreach
> 					   hugetlb_vma_trylock_write -- Vma_lock is cleared.
>    tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
> 					   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
> 					 i_mmap_unlock_write(mapping);
> 
> hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside
> i_mmap_rwsem lock while vma lock can be used in the same time.  Fix this
> by deferring linking file vma until vma is fully initialized.  Those vmas
> should be initialized first before they can be used.
> 
> Backport notes:
> 
> The first backport attempt (cec11fa2e) was reverted (dd782da4707). This is
> the new backport of the original fix (35e351780fa9).
> 
> 35e351780f ("fork: defer linking file vma until vma is fully initialized")
> fixed a hugetlb locking race by moving a bunch of intialization code to earlier
> in the function. The call to open() was included in the move but the call to
> copy_page_range was not, effectively inverting their relative ordering. This
> created an issue for the vfio code which assumes copy_page_range happens before
> the call to open() - vfio's open zaps the vma so that the fault handler is
> invoked later, but when we inverted the ordering, copy_page_range can set up
> mappings post-zap which would prevent the fault handler from being invoked
> later. This patch moves the call to copy_page_range to earlier than the call to
> open() to restore the original ordering of the two functions while keeping the
> fix for hugetlb intact.

Thanks for your update!

> 
> Commit aac6db75a9 made several changes to vfio_pci_core.c, including
> removing the vfio-pci custom open function. This resolves the issue on
> the main branch and so we only need to apply these changes when
> backporting to stable branches.
> 
> 35e351780f ("fork: defer linking file vma until vma is fully initialized")-> v6.9-rc5
> aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") -> v6.10-rc4
> 
> Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@xxxxxxxxxx
> Fixes: 8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing")
> Signed-off-by: Miaohe Lin <linmiaohe@xxxxxxxxxx>
> Reported-by: Thorvald Natvig <thorvald@xxxxxxxxxx>
> Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1]
> Reviewed-by: Jane Chu <jane.chu@xxxxxxxxxx>
> Cc: Christian Brauner <brauner@xxxxxxxxxx>
> Cc: Heiko Carstens <hca@xxxxxxxxxxxxx>
> Cc: Kent Overstreet <kent.overstreet@xxxxxxxxx>
> Cc: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx>
> Cc: Mateusz Guzik <mjguzik@xxxxxxxxx>
> Cc: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
> Cc: Miaohe Lin <linmiaohe@xxxxxxxxxx>
> Cc: Muchun Song <muchun.song@xxxxxxxxx>
> Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
> Cc: Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx>
> Cc: Tycho Andersen <tandersen@xxxxxxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Miaohe Lin <linmiaohe@xxxxxxxxxx>
> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Leah Rumancik <leah.rumancik@xxxxxxxxx>
> ---
>  kernel/fork.c | 27 +++++++++++++--------------
>  1 file changed, 13 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 177ce7438db6..122d2cd124d5 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -727,6 +727,19 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		} else if (anon_vma_fork(tmp, mpnt))
>  			goto fail_nomem_anon_vma_fork;
>  		vm_flags_clear(tmp, VM_LOCKED_MASK);
> +		/*
> +		 * Copy/update hugetlb private vma information.
> +		 */
> +		if (is_vm_hugetlb_page(tmp))
> +			hugetlb_dup_vma_private(tmp);
> +
> +		if (!(tmp->vm_flags & VM_WIPEONFORK) &&
> +				copy_page_range(tmp, mpnt))
> +			goto fail_nomem_vmi_store;
> +
> +		if (tmp->vm_ops && tmp->vm_ops->open)
> +			tmp->vm_ops->open(tmp);
> +
>  		file = tmp->vm_file;
>  		if (file) {
>  			struct address_space *mapping = file->f_mapping;
> @@ -743,25 +756,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  			i_mmap_unlock_write(mapping);
>  		}
>  
> -		/*
> -		 * Copy/update hugetlb private vma information.
> -		 */
> -		if (is_vm_hugetlb_page(tmp))
> -			hugetlb_dup_vma_private(tmp);
> -
>  		/* Link the vma into the MT */
>  		if (vma_iter_bulk_store(&vmi, tmp))
>  			goto fail_nomem_vmi_store;
>  
>  		mm->map_count++;
> -		if (!(tmp->vm_flags & VM_WIPEONFORK))
> -			retval = copy_page_range(tmp, mpnt);

I have a vague memory that copy_page_range should be called after vma is inserted into the i_mmap tree.
Or there might be a problem:

dup_mmap			remove_migration_ptes
 copy_page_range -- Child process copys migration entry from parent
				 rmap_walk
				  rmap_walk_file
				   i_mmap_lock_read(mapping);
				   vma_interval_tree_foreach
				    remove_migration_pte -- The vma of child process is still invisible
				    So migration entry lefts in the child process's address space.
				   i_mmap_unlock_read(mapping);
 i_mmap_lock_write(mapping);
 vma_interval_tree_insert_after
  -- To late! Child process has stale migration entry left while migration is already done!
 i_mmap_unlock_write(mapping);
				
But I'm not really sure. I might miss something.
Thanks.
.





[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux