Patch "userfaultfd: fix checks for huge PMDs" has been added to the 6.1-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Mon, 9 Sep 2024 08:48:38 -0400

This is a note to let you know that I've just added the patch titled

    userfaultfd: fix checks for huge PMDs

to the 6.1-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     userfaultfd-fix-checks-for-huge-pmds.patch
and it can be found in the queue-6.1 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 170b9e3f04840e9c1a1e14c687b689e84d8f0d9d
Author: Jann Horn <jannh@xxxxxxxxxx>
Date:   Tue Aug 13 22:25:21 2024 +0200

    userfaultfd: fix checks for huge PMDs
    
    [ Upstream commit 71c186efc1b2cf1aeabfeff3b9bd5ac4c5ac14d8 ]
    
    Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2.
    
    The pmd_trans_huge() code in mfill_atomic() is wrong in three different
    ways depending on kernel version:
    
    1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit
       the right two race windows) - I've tested this in a kernel build with
       some extra mdelay() calls. See the commit message for a description
       of the race scenario.
       On older kernels (before 6.5), I think the same bug can even
       theoretically lead to accessing transhuge page contents as a page table
       if you hit the right 5 narrow race windows (I haven't tested this case).
    2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for
       detecting PMDs that don't point to page tables.
       On older kernels (before 6.5), you'd just have to win a single fairly
       wide race to hit this.
       I've tested this on 6.1 stable by racing migration (with a mdelay()
       patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86
       VM, that causes a kernel oops in ptlock_ptr().
    3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed
       to yank page tables out from under us (though I haven't tested that),
       so I think the BUG_ON() checks in mfill_atomic() are just wrong.
    
    I decided to write two separate fixes for these (one fix for bugs 1+2, one
    fix for bug 3), so that the first fix can be backported to kernels
    affected by bugs 1+2.
    
    This patch (of 2):
    
    This fixes two issues.
    
    I discovered that the following race can occur:
    
      mfill_atomic                other thread
      ============                ============
                                  <zap PMD>
      pmdp_get_lockless() [reads none pmd]
      <bail if trans_huge>
      <if none:>
                                  <pagefault creates transhuge zeropage>
        __pte_alloc [no-op]
                                  <zap PMD>
      <bail if pmd_trans_huge(*dst_pmd)>
      BUG_ON(pmd_none(*dst_pmd))
    
    I have experimentally verified this in a kernel with extra mdelay() calls;
    the BUG_ON(pmd_none(*dst_pmd)) triggers.
    
    On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow
    pte_offset_map[_lock]() to fail"), this can't lead to anything worse than
    a BUG_ON(), since the page table access helpers are actually designed to
    deal with page tables concurrently disappearing; but on older kernels
    (<=6.4), I think we could probably theoretically race past the two
    BUG_ON() checks and end up treating a hugepage as a page table.
    
    The second issue is that, as Qi Zheng pointed out, there are other types
    of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs
    (in particular, migration PMDs).
    
    On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a
    PMD that contains a migration entry (which just requires winning a single,
    fairly wide race), it will pass the PMD to pte_offset_map_lock(), which
    assumes that the PMD points to a page table.
    
    Breakage follows: First, the kernel tries to take the PTE lock (which will
    crash or maybe worse if there is no "struct page" for the address bits in
    the migration entry PMD - I think at least on X86 there usually is no
    corresponding "struct page" thanks to the PTE inversion mitigation, amd64
    looks different).
    
    If that didn't crash, the kernel would next try to write a PTE into what
    it wrongly thinks is a page table.
    
    As part of fixing these issues, get rid of the check for pmd_trans_huge()
    before __pte_alloc() - that's redundant, we're going to have to check for
    that after the __pte_alloc() anyway.
    
    Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels.
    
    Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-0-5efa61078a41@xxxxxxxxxx
    Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@xxxxxxxxxx
    Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
    Signed-off-by: Jann Horn <jannh@xxxxxxxxxx>
    Acked-by: David Hildenbrand <david@xxxxxxxxxx>
    Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
    Cc: Hugh Dickins <hughd@xxxxxxxxxx>
    Cc: Jann Horn <jannh@xxxxxxxxxx>
    Cc: Pavel Emelyanov <xemul@xxxxxxxxxxxxx>
    Cc: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
    Cc: <stable@xxxxxxxxxxxxxxx>
    Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 5d873aadec76..50c01a7eb705 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -642,21 +642,23 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		}
 
 		dst_pmdval = pmdp_get_lockless(dst_pmd);
-		/*
-		 * If the dst_pmd is mapped as THP don't
-		 * override it and just be strict.
-		 */
-		if (unlikely(pmd_trans_huge(dst_pmdval))) {
-			err = -EEXIST;
-			break;
-		}
 		if (unlikely(pmd_none(dst_pmdval)) &&
 		    unlikely(__pte_alloc(dst_mm, dst_pmd))) {
 			err = -ENOMEM;
 			break;
 		}
-		/* If an huge pmd materialized from under us fail */
-		if (unlikely(pmd_trans_huge(*dst_pmd))) {
+		dst_pmdval = pmdp_get_lockless(dst_pmd);
+		/*
+		 * If the dst_pmd is THP don't override it and just be strict.
+		 * (This includes the case where the PMD used to be THP and
+		 * changed back to none after __pte_alloc().)
+		 */
+		if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) ||
+			     pmd_devmap(dst_pmdval))) {
+			err = -EEXIST;
+			break;
+		}
+		if (unlikely(pmd_bad(dst_pmdval))) {
 			err = -EFAULT;
 			break;
 		}