Re: [PATCH 3.12 78/78] mm: let mm_find_pmd fix buggy race with THP fault

Hugh Dickins <hughd@xxxxxxxxxx> · Fri, 9 Jan 2015 21:01:16 -0800 (PST)

On Fri, 9 Jan 2015, Jiri Slaby wrote:

> From: Hugh Dickins <hughd@xxxxxxxxxx>
> 
> 3.12-stable review patch.  If anyone has any objections, please let me know.
> 
> ===============
> 
> commit f72e7dcdd25229446b102e587ef2f826f76bff28 upstream.
> 
> Trinity has reported:
> 
>     BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>     IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
>     CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G        W
>                             3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
>     lock_acquire (arch/x86/include/asm/current.h:14
>                   kernel/locking/lockdep.c:3602)
>     _raw_spin_lock (include/linux/spinlock_api_smp.h:143
>                     kernel/locking/spinlock.c:151)
>     remove_migration_pte (mm/migrate.c:137)
>     rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
>     remove_migration_ptes (mm/migrate.c:224)
>     migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
>     migrate_misplaced_page (mm/migrate.c:1733)
>     __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
>     handle_mm_fault (mm/memory.c:3948)
>     __get_user_pages (mm/memory.c:1851)
>     __mlock_vma_pages_range (mm/mlock.c:255)
>     __mm_populate (mm/mlock.c:711)
>     SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
> 
> I believe this comes about because, whereas collapsing and splitting THP
> functions take anon_vma lock in write mode (which excludes concurrent
> rmap walks), faulting THP functions (write protection and misplaced
> NUMA) do not - and mostly they do not need to.
> 
> But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
> an instant (indeed, for a long instant, given the inter-CPU TLB flush in
> there), leaves *pmd neither present not trans_huge.
> 
> Which can confuse a concurrent rmap walk, as when removing migration
> ptes, seen in the dumped trace.  Although that rmap walk has a 4k page
> to insert, anon_vmas containing THPs are in no way segregated from
> 4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
> that instant when a trans_huge pmd is temporarily absent.
> 
> I don't think we need strengthen the locking at the THP end: it's easily
> handled with an ACCESS_ONCE() before testing both conditions.
> 
> And since mm_find_pmd() had only one caller who wanted a THP rather than
> a pmd, let's slightly repurpose it to fail when it hits a THP or
> non-present pmd, and open code split_huge_page_address() again.
> 
> Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
> Reported-by: Sasha Levin <sasha.levin@xxxxxxxxxx>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
> Cc: Konstantin Khlebnikov <koct9i@xxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>
> Cc: Bob Liu <bob.liu@xxxxxxxxxx>
> Cc: Christoph Lameter <cl@xxxxxxxxxx>
> Cc: Dave Jones <davej@xxxxxxxxxx>
> Cc: David Rientjes <rientjes@xxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Jiri Slaby <jslaby@xxxxxxx>
> ---
>  mm/huge_memory.c | 18 ++++++++++++------
>  mm/ksm.c         |  1 -
>  mm/migrate.c     |  2 --
>  mm/rmap.c        | 12 ++++++++----
>  4 files changed, 20 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e497843f5f65..04d17ba00893 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2408,8 +2408,6 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	pmd = mm_find_pmd(mm, address);
>  	if (!pmd)
>  		goto out;
> -	if (pmd_trans_huge(*pmd))
> -		goto out;
>  
>  	anon_vma_lock_write(vma->anon_vma);
>  
> @@ -2508,8 +2506,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  	pmd = mm_find_pmd(mm, address);
>  	if (!pmd)
>  		goto out;
> -	if (pmd_trans_huge(*pmd))
> -		goto out;
>  
>  	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
>  	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> @@ -2863,12 +2859,22 @@ void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
>  static void split_huge_page_address(struct mm_struct *mm,
>  				    unsigned long address)
>  {
> +	pgd_t *pgd;
> +	pud_t *pud;
>  	pmd_t *pmd;
>  
>  	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
>  
> -	pmd = mm_find_pmd(mm, address);
> -	if (!pmd)
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		return;
> +
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		return;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (!pmd_present(*pmd))
>  		return;
>  	/*
>  	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
> diff --git a/mm/ksm.c b/mm/ksm.c
> index c78fff1e9eae..29cbd06c4884 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -945,7 +945,6 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  	pmd = mm_find_pmd(mm, addr);
>  	if (!pmd)
>  		goto out;
> -	BUG_ON(pmd_trans_huge(*pmd));
>  
>  	mmun_start = addr;
>  	mmun_end   = addr + PAGE_SIZE;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d5c84b0a5243..fac5fa0813c4 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -136,8 +136,6 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
>  		pmd = mm_find_pmd(mm, addr);
>  		if (!pmd)
>  			goto out;
> -		if (pmd_trans_huge(*pmd))
> -			goto out;
>  
>  		ptep = pte_offset_map(pmd, addr);
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5b8675ccc1ef..440c71c43b8d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -571,6 +571,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>  	pgd_t *pgd;
>  	pud_t *pud;
>  	pmd_t *pmd = NULL;
> +	pmd_t pmde;
>  
>  	pgd = pgd_offset(mm, address);
>  	if (!pgd_present(*pgd))
> @@ -581,7 +582,13 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>  		goto out;
>  
>  	pmd = pmd_offset(pud, address);
> -	if (!pmd_present(*pmd))
> +	/*
> +	 * Some THP functions use the sequence pmdp_clear_flush(), set_pmd_at()
> +	 * without holding anon_vma lock for write.  So when looking for a
> +	 * genuine pmde (in which to find pte), test present and !THP together.
> +	 */
> +	pmde = ACCESS_ONCE(*pmd);
> +	if (!pmd_present(pmde) || pmd_trans_huge(pmde))
>  		pmd = NULL;
>  out:
>  	return pmd;
> @@ -617,9 +624,6 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
>  	if (!pmd)
>  		return NULL;
>  
> -	if (pmd_trans_huge(*pmd))
> -		return NULL;
> -
>  	pte = pte_offset_map(pmd, address);
>  	/* Make a quick check before getting the lock */
>  	if (!sync && !pte_present(*pte)) {
> -- 
> 2.2.1

Fine for this to go in, but there is one catch, which I discovered when
backporting to v3.11: it needed one more hunk.  I haven't checked your
base tree, but if this applies then I believe you need it - most of the
time no problem, but it can case page migration to fail to find a
migration entry it inserted earlier, then BUG_ON(!PageLocked(p)) in
migration_entry_to_page() soon after.  Here's what I wrote back then:

Note on rebase to v3.11: added a hunk to replace the use of mm_find_pmd()
in page_check_address_pmd().  This call had been similarly replaced by
the time of my v3.16 commit, in Kirill Shutemov's v3.15 b5a8cad376ee
("thp: close race between split and zap huge pages"): which we do not
need as such, since it's fixing v3.13 117b0791ac42 ("mm, thp: move ptl
taking inside page_check_address_pmd()"), from a split page-table-lock
series we are not backporting.  But without this additional hunk, rmap
sometimes broke when the new semantic for mm_find_pmd() was used here.

(Adding Kirill to Cc: shouldn't he have been Cc'ed already?)

Hugh

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1584,12 +1584,20 @@ pmd_t *page_check_address_pmd(struct page *page,
 			      unsigned long address,
 			      enum page_check_address_pmd_flag flag)
 {
+	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd, *ret = NULL;
 
 	if (address & ~HPAGE_PMD_MASK)
 		goto out;
 
-	pmd = mm_find_pmd(mm, address);
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+	pmd = pmd_offset(pud, address);
 	if (!pmd)
 		goto out;
 	if (pmd_none(*pmd))
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html