On Mon, Oct 07, 2024 at 11:42:04PM +0200, Jann Horn wrote: > In mremap(), move_page_tables() looks at the type of the PMD entry and the > specified address range to figure out by which method the next chunk of > page table entries should be moved. > At that point, the mmap_lock is held in write mode, but no rmap locks are > held yet. For PMD entries that point to page tables and are fully covered > by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called, > which first takes rmap locks, then does move_normal_pmd(). > move_normal_pmd() takes the necessary page table locks at source and > destination, then moves an entire page table from the source to the > destination. > > The problem is: The rmap locks, which protect against concurrent page table > removal by retract_page_tables() in the THP code, are only taken after the > PMD entry has been read and it has been decided how to move it. > So we can race as follows (with two processes that have mappings of the > same tmpfs file that is stored on a tmpfs mount with huge=advise); note > that process A accesses page tables through the MM while process B does it > through the file rmap: > > > process A process B > ========= ========= > mremap > mremap_to > move_vma > move_page_tables > get_old_pmd > alloc_new_pmd > *** PREEMPT *** > madvise(MADV_COLLAPSE) > do_madvise > madvise_walk_vmas > madvise_vma_behavior > madvise_collapse > hpage_collapse_scan_file > collapse_file > retract_page_tables > i_mmap_lock_read(mapping) > pmdp_collapse_flush > i_mmap_unlock_read(mapping) > move_pgt_entry(NORMAL_PMD, ...) > take_rmap_locks > move_normal_pmd > drop_rmap_locks > > When this happens, move_normal_pmd() can end up creating bogus PMD entries > in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. > The effect depends on arch-specific and machine-specific details; on x86, > you can end up with physical page 0 mapped as a page table, which is likely > exploitable for user->kernel privilege escalation. > > > Fix the race by letting process B recheck that the PMD still points to a > page table after the rmap locks have been taken. Otherwise, we bail and let > the caller fall back to the PTE-level copying path, which will then bail > immediately at the pmd_none() check. > > Bug reachability: Reaching this bug requires that you can create shmem/file > THP mappings - anonymous THP uses different code that doesn't zap stuff > under rmap locks. File THP is gated on an experimental config flag > (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need shmem > THP to hit this bug. As far as I know, getting shmem THP normally requires > that you can mount your own tmpfs with the right mount flags, which would > require creating your own user+mount namespace; though I don't know if some > distros maybe enable shmem THP by default or something like that. Any repro? > > Bug impact: This issue can likely be used for user->kernel privilege > escalation when it is reachable. > > Cc: stable@xxxxxxxxxxxxxxx > Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock") > Closes: https://project-zero.issues.chromium.org/371047675 > Co-developed-by: David Hildenbrand <david@xxxxxxxxxx> > Signed-off-by: Jann Horn <jannh@xxxxxxxxxx> Ugh man this PMD locking thing is horrid. This is subtle and deeply painful and I feel like we need some better way of expressing this locking. Documenting this stuff, or at least VMA side remains on my todo list... Anyway this patch looks sane: Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > --- > @David: please confirm we can add your Signed-off-by to this patch after > the Co-developed-by. > (Context: David basically wrote the entire patch except for the commit > message.) The fact David did that automatically gives me confidence in this change from mm side. :) > > @akpm: This replaces the previous "[PATCH] mm/mremap: Prevent racing > change of old pmd type". > --- > mm/mremap.c | 11 +++++++++-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/mm/mremap.c b/mm/mremap.c > index 24712f8dbb6b..dda09e957a5d 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -238,6 +238,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, > { > spinlock_t *old_ptl, *new_ptl; > struct mm_struct *mm = vma->vm_mm; > + bool res = false; > pmd_t pmd; > > if (!arch_supports_page_table_move()) > @@ -277,19 +278,25 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, > if (new_ptl != old_ptl) > spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); > > - /* Clear the pmd */ > pmd = *old_pmd; > + > + /* Racing with collapse? */ > + if (unlikely(!pmd_present(pmd) || pmd_leaf(pmd))) > + goto out_unlock; > + /* Clear the pmd */ > pmd_clear(old_pmd); > + res = true; > > VM_BUG_ON(!pmd_none(*new_pmd)); > > pmd_populate(mm, new_pmd, pmd_pgtable(pmd)); > flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE); > +out_unlock: > if (new_ptl != old_ptl) > spin_unlock(new_ptl); > spin_unlock(old_ptl); > > - return true; > + return res; > } > #else > static inline bool move_normal_pmd(struct vm_area_struct *vma, > > --- > base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b > change-id: 20241007-move_normal_pmd-vs-collapse-fix-2-387e9a68c7d6 > -- > Jann Horn <jannh@xxxxxxxxxx> >