Patch "mm: fix race between __split_huge_pmd_locked() and GUP-fast" has been added to the 5.15-stable tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is a note to let you know that I've just added the patch titled

    mm: fix race between __split_huge_pmd_locked() and GUP-fast

to the 5.15-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     mm-fix-race-between-__split_huge_pmd_locked-and-gup-.patch
and it can be found in the queue-5.15 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit dca09ad288fc1dd6652c82f0aa90f993a357f4f8
Author: Ryan Roberts <ryan.roberts@xxxxxxx>
Date:   Wed May 1 15:33:10 2024 +0100

    mm: fix race between __split_huge_pmd_locked() and GUP-fast
    
    [ Upstream commit 3a5a8d343e1cf96eb9971b17cbd4b832ab19b8e7 ]
    
    __split_huge_pmd_locked() can be called for a present THP, devmap or
    (non-present) migration entry.  It calls pmdp_invalidate() unconditionally
    on the pmdp and only determines if it is present or not based on the
    returned old pmd.  This is a problem for the migration entry case because
    pmd_mkinvalid(), called by pmdp_invalidate() must only be called for a
    present pmd.
    
    On arm64 at least, pmd_mkinvalid() will mark the pmd such that any future
    call to pmd_present() will return true.  And therefore any lockless
    pgtable walker could see the migration entry pmd in this state and start
    interpretting the fields as if it were present, leading to BadThings (TM).
    GUP-fast appears to be one such lockless pgtable walker.
    
    x86 does not suffer the above problem, but instead pmd_mkinvalid() will
    corrupt the offset field of the swap entry within the swap pte.  See link
    below for discussion of that problem.
    
    Fix all of this by only calling pmdp_invalidate() for a present pmd.  And
    for good measure let's add a warning to all implementations of
    pmdp_invalidate[_ad]().  I've manually reviewed all other
    pmdp_invalidate[_ad]() call sites and believe all others to be conformant.
    
    This is a theoretical bug found during code review.  I don't have any test
    case to trigger it in practice.
    
    Link: https://lkml.kernel.org/r/20240501143310.1381675-1-ryan.roberts@xxxxxxx
    Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@xxxxxxx/
    Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
    Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
    Reviewed-by: Zi Yan <ziy@xxxxxxxxxx>
    Reviewed-by: Anshuman Khandual <anshuman.khandual@xxxxxxx>
    Acked-by: David Hildenbrand <david@xxxxxxxxxx>
    Cc: Andreas Larsson <andreas@xxxxxxxxxxx>
    Cc: Andy Lutomirski <luto@xxxxxxxxxx>
    Cc: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxx>
    Cc: Borislav Petkov (AMD) <bp@xxxxxxxxx>
    Cc: Catalin Marinas <catalin.marinas@xxxxxxx>
    Cc: Christian Borntraeger <borntraeger@xxxxxxxxxxxxx>
    Cc: Christophe Leroy <christophe.leroy@xxxxxxxxxx>
    Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
    Cc: "David S. Miller" <davem@xxxxxxxxxxxxx>
    Cc: Ingo Molnar <mingo@xxxxxxxxxx>
    Cc: Jonathan Corbet <corbet@xxxxxxx>
    Cc: Mark Rutland <mark.rutland@xxxxxxx>
    Cc: Naveen N. Rao <naveen.n.rao@xxxxxxxxxxxxx>
    Cc: Nicholas Piggin <npiggin@xxxxxxxxx>
    Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
    Cc: Sven Schnelle <svens@xxxxxxxxxxxxx>
    Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
    Cc: Will Deacon <will@xxxxxxxxxx>
    Cc: <stable@xxxxxxxxxxxxxxx>
    Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/Documentation/vm/arch_pgtable_helpers.rst b/Documentation/vm/arch_pgtable_helpers.rst
index 552567d863b86..b8ae5d040b998 100644
--- a/Documentation/vm/arch_pgtable_helpers.rst
+++ b/Documentation/vm/arch_pgtable_helpers.rst
@@ -134,7 +134,8 @@ PMD Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | pmd_swp_clear_soft_dirty  | Clears a soft dirty swapped PMD                  |
 +---------------------------+--------------------------------------------------+
-| pmd_mkinvalid             | Invalidates a mapped PMD [1]                     |
+| pmd_mkinvalid             | Invalidates a present PMD; do not call for       |
+|                           | non-present PMD [1]                              |
 +---------------------------+--------------------------------------------------+
 | pmd_set_huge              | Creates a PMD huge mapping                       |
 +---------------------------+--------------------------------------------------+
@@ -190,7 +191,8 @@ PUD Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | pud_mkdevmap              | Creates a ZONE_DEVICE mapped PUD                 |
 +---------------------------+--------------------------------------------------+
-| pud_mkinvalid             | Invalidates a mapped PUD [1]                     |
+| pud_mkinvalid             | Invalidates a present PUD; do not call for       |
+|                           | non-present PUD [1]                              |
 +---------------------------+--------------------------------------------------+
 | pud_set_huge              | Creates a PUD huge mapping                       |
 +---------------------------+--------------------------------------------------+
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index da15f28c7b13a..3a22e7d970f33 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -115,6 +115,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 {
 	unsigned long old_pmd;
 
+	VM_WARN_ON_ONCE(!pmd_present(*pmdp));
 	old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, _PAGE_INVALID);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	return __pmd(old_pmd);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index b61426c9ef178..b65ce0c90dd0e 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1625,8 +1625,10 @@ static inline pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma,
 static inline pmd_t pmdp_invalidate(struct vm_area_struct *vma,
 				   unsigned long addr, pmd_t *pmdp)
 {
-	pmd_t pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID);
+	pmd_t pmd;
 
+	VM_WARN_ON_ONCE(!pmd_present(*pmdp));
+	pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID);
 	return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd);
 }
 
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 9a725547578e8..946f33c1b032f 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -245,6 +245,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 {
 	pmd_t old, entry;
 
+	VM_WARN_ON_ONCE(!pmd_present(*pmdp));
 	entry = __pmd(pmd_val(*pmdp) & ~_PAGE_VALID);
 	old = pmdp_establish(vma, address, pmdp, entry);
 	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f16059e9a85e7..5c2be867a2ed9 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -612,6 +612,8 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
 			 pmd_t *pmdp)
 {
+	VM_WARN_ON_ONCE(!pmd_present(*pmdp));
+
 	/*
 	 * No flush is necessary. Once an invalid PTE is established, the PTE's
 	 * access and dirty bits cannot be updated.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 265ef8d1393c5..99d38f712863b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2024,32 +2024,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 	}
 
-	/*
-	 * Up to this point the pmd is present and huge and userland has the
-	 * whole access to the hugepage during the split (which happens in
-	 * place). If we overwrite the pmd with the not-huge version pointing
-	 * to the pte here (which of course we could if all CPUs were bug
-	 * free), userland could trigger a small page size TLB miss on the
-	 * small sized TLB while the hugepage TLB entry is still established in
-	 * the huge TLB. Some CPU doesn't like that.
-	 * See http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum
-	 * 383 on page 105. Intel should be safe but is also warns that it's
-	 * only safe if the permission and cache attributes of the two entries
-	 * loaded in the two TLB is identical (which should be the case here).
-	 * But it is generally safer to never allow small and huge TLB entries
-	 * for the same virtual address to be loaded simultaneously. So instead
-	 * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
-	 * current pmd notpresent (atomically because here the pmd_trans_huge
-	 * must remain set at all times on the pmd until the split is complete
-	 * for this pmd), then we flush the SMP TLB and finally we write the
-	 * non-huge version of the pmd entry with pmd_populate.
-	 */
-	old_pmd = pmdp_invalidate(vma, haddr, pmd);
-
-	pmd_migration = is_pmd_migration_entry(old_pmd);
+	pmd_migration = is_pmd_migration_entry(*pmd);
 	if (unlikely(pmd_migration)) {
 		swp_entry_t entry;
 
+		old_pmd = *pmd;
 		entry = pmd_to_swp_entry(old_pmd);
 		page = pfn_swap_entry_to_page(entry);
 		write = is_writable_migration_entry(entry);
@@ -2057,6 +2036,30 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
+		/*
+		 * Up to this point the pmd is present and huge and userland has
+		 * the whole access to the hugepage during the split (which
+		 * happens in place). If we overwrite the pmd with the not-huge
+		 * version pointing to the pte here (which of course we could if
+		 * all CPUs were bug free), userland could trigger a small page
+		 * size TLB miss on the small sized TLB while the hugepage TLB
+		 * entry is still established in the huge TLB. Some CPU doesn't
+		 * like that. See
+		 * http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum
+		 * 383 on page 105. Intel should be safe but is also warns that
+		 * it's only safe if the permission and cache attributes of the
+		 * two entries loaded in the two TLB is identical (which should
+		 * be the case here). But it is generally safer to never allow
+		 * small and huge TLB entries for the same virtual address to be
+		 * loaded simultaneously. So instead of doing "pmd_populate();
+		 * flush_pmd_tlb_range();" we first mark the current pmd
+		 * notpresent (atomically because here the pmd_trans_huge must
+		 * remain set at all times on the pmd until the split is
+		 * complete for this pmd), then we flush the SMP TLB and finally
+		 * we write the non-huge version of the pmd entry with
+		 * pmd_populate.
+		 */
+		old_pmd = pmdp_invalidate(vma, haddr, pmd);
 		page = pmd_page(old_pmd);
 		if (pmd_dirty(old_pmd))
 			SetPageDirty(page);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index b0ce6c7391bf4..cc8b11724cf5a 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -194,6 +194,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
+	VM_WARN_ON_ONCE(!pmd_present(*pmdp));
 	pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	return old;
@@ -204,6 +205,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
 			 pmd_t *pmdp)
 {
+	VM_WARN_ON_ONCE(!pmd_present(*pmdp));
 	return pmdp_invalidate(vma, address, pmdp);
 }
 #endif




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux