Re: [PATCH] mm/mremap: Fix move_normal_pmd/retract_page_tables race

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Wed, 9 Oct 2024 15:44:12 +0100

On Mon, Oct 07, 2024 at 11:42:04PM +0200, Jann Horn wrote:
> In mremap(), move_page_tables() looks at the type of the PMD entry and the
> specified address range to figure out by which method the next chunk of
> page table entries should be moved.
> At that point, the mmap_lock is held in write mode, but no rmap locks are
> held yet. For PMD entries that point to page tables and are fully covered
> by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called,
> which first takes rmap locks, then does move_normal_pmd().
> move_normal_pmd() takes the necessary page table locks at source and
> destination, then moves an entire page table from the source to the
> destination.
>
> The problem is: The rmap locks, which protect against concurrent page table
> removal by retract_page_tables() in the THP code, are only taken after the
> PMD entry has been read and it has been decided how to move it.
> So we can race as follows (with two processes that have mappings of the
> same tmpfs file that is stored on a tmpfs mount with huge=advise); note
> that process A accesses page tables through the MM while process B does it
> through the file rmap:
>
>
> process A                      process B
> =========                      =========
> mremap
>   mremap_to
>     move_vma
>       move_page_tables
>         get_old_pmd
>         alloc_new_pmd
>                       *** PREEMPT ***
>                                madvise(MADV_COLLAPSE)
>                                  do_madvise
>                                    madvise_walk_vmas
>                                      madvise_vma_behavior
>                                        madvise_collapse
>                                          hpage_collapse_scan_file
>                                            collapse_file
>                                              retract_page_tables
>                                                i_mmap_lock_read(mapping)
>                                                pmdp_collapse_flush
>                                                i_mmap_unlock_read(mapping)
>         move_pgt_entry(NORMAL_PMD, ...)
>           take_rmap_locks
>           move_normal_pmd
>           drop_rmap_locks
>
> When this happens, move_normal_pmd() can end up creating bogus PMD entries
> in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`.
> The effect depends on arch-specific and machine-specific details; on x86,
> you can end up with physical page 0 mapped as a page table, which is likely
> exploitable for user->kernel privilege escalation.
>
>
> Fix the race by letting process B recheck that the PMD still points to a
> page table after the rmap locks have been taken. Otherwise, we bail and let
> the caller fall back to the PTE-level copying path, which will then bail
> immediately at the pmd_none() check.
>
> Bug reachability: Reaching this bug requires that you can create shmem/file
> THP mappings - anonymous THP uses different code that doesn't zap stuff
> under rmap locks. File THP is gated on an experimental config flag
> (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need shmem
> THP to hit this bug. As far as I know, getting shmem THP normally requires
> that you can mount your own tmpfs with the right mount flags, which would
> require creating your own user+mount namespace; though I don't know if some
> distros maybe enable shmem THP by default or something like that.

Any repro?

>
> Bug impact: This issue can likely be used for user->kernel privilege
> escalation when it is reachable.
>
> Cc: stable@xxxxxxxxxxxxxxx
> Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
> Closes: https://project-zero.issues.chromium.org/371047675
> Co-developed-by: David Hildenbrand <david@xxxxxxxxxx>
> Signed-off-by: Jann Horn <jannh@xxxxxxxxxx>

Ugh man this PMD locking thing is horrid. This is subtle and deeply painful and
I feel like we need some better way of expressing this locking.

Documenting this stuff, or at least VMA side remains on my todo list...

Anyway this patch looks sane:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>

> ---
> @David: please confirm we can add your Signed-off-by to this patch after
> the Co-developed-by.
> (Context: David basically wrote the entire patch except for the commit
> message.)

The fact David did that automatically gives me confidence in this change
from mm side. :)

>
> @akpm: This replaces the previous "[PATCH] mm/mremap: Prevent racing
> change of old pmd type".
> ---
>  mm/mremap.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 24712f8dbb6b..dda09e957a5d 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -238,6 +238,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  {
>  	spinlock_t *old_ptl, *new_ptl;
>  	struct mm_struct *mm = vma->vm_mm;
> +	bool res = false;
>  	pmd_t pmd;
>
>  	if (!arch_supports_page_table_move())
> @@ -277,19 +278,25 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  	if (new_ptl != old_ptl)
>  		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
>
> -	/* Clear the pmd */
>  	pmd = *old_pmd;
> +
> +	/* Racing with collapse? */
> +	if (unlikely(!pmd_present(pmd) || pmd_leaf(pmd)))
> +		goto out_unlock;
> +	/* Clear the pmd */
>  	pmd_clear(old_pmd);
> +	res = true;
>
>  	VM_BUG_ON(!pmd_none(*new_pmd));
>
>  	pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
>  	flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
> +out_unlock:
>  	if (new_ptl != old_ptl)
>  		spin_unlock(new_ptl);
>  	spin_unlock(old_ptl);
>
> -	return true;
> +	return res;
>  }
>  #else
>  static inline bool move_normal_pmd(struct vm_area_struct *vma,
>
> ---
> base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
> change-id: 20241007-move_normal_pmd-vs-collapse-fix-2-387e9a68c7d6
> --
> Jann Horn <jannh@xxxxxxxxxx>
>