Re: BUG at mm/memory.c:1489!

Hugh Dickins <hughd@xxxxxxxxxx> · Wed, 28 May 2014 17:33:11 -0700 (PDT)

On Wed, 28 May 2014, Michael Ellerman wrote:
> Hey folks,
> 
> Anyone seen this before? Trinity hit it just now:
> 
> Linux Blade312-5 3.15.0-rc7 #306 SMP Wed May 28 17:51:18 EST 2014 ppc64
> 
> [watchdog] 27853 iterations. [F:22642 S:5174 HI:1276]
> ------------[ cut here ]------------
> kernel BUG at /home/michael/mmk-build/flow/mm/memory.c:1489!
> cpu 0xc: Vector: 700 (Program Check) at [c000000384eaf960]
>     pc: c0000000001ad6f0: .follow_page_mask+0x90/0x650
>     lr: c0000000001ad6d8: .follow_page_mask+0x78/0x650
>     sp: c000000384eafbe0
>    msr: 8000000000029032
>   current = 0xc0000003c27e1bc0
>   paca    = 0xc000000001dc3000   softe: 0        irq_happened: 0x01
>     pid   = 20800, comm = trinity-c12
> kernel BUG at /home/michael/mmk-build/flow/mm/memory.c:1489!
> enter ? for help
> [c000000384eafcc0] c0000000001e5514 .SyS_move_pages+0x524/0x7d0
> [c000000384eafe30] c00000000000a1d8 syscall_exit+0x0/0x98
> --- Exception: c01 (System Call) at 00003fff795f30a8
> SP (3ffff958f290) is in userspace
> 
> I've left it in the debugger, can dig into it a bit more tomorrow
> if anyone has any clues.

Thanks for leaving it overnight, but this one is quite obvious,
so go ahead and reboot whenever suits you.

Trinity didn't even need to do anything bizarre to get this: that
ordinary path simply didn't get tried on powerpc or ia64 before.

Here's a patch which should fix it for you, but I believe leaves
a race in common with other architectures.  I must turn away to
other things, and hope Naoya-san can fix up the locking separately
(or point out why it's already safe).

[PATCH] mm: fix move_pages follow_page huge_addr BUG

v3.12's e632a938d914 ("mm: migrate: add hugepage migration code to
move_pages()") is okay on most arches, but on follow_huge_addr-style
arches ia64 and powerpc, it hits my old BUG_ON(flags & FOLL_GET)
from v2.6.15 deceb6cd17e6 ("mm: follow_page with inner ptlock").

The point of the BUG_ON was that nothing needed FOLL_GET there at
the time, and it was not clear that we have sufficient locking to
use get_page() safely here on the outside - maybe the page found has
already been freed and even reused when follow_huge_addr() returns.

I suspect that e632a938d914's use of get_page() after return from
follow_huge_pmd() has the same problem: what prevents a racing
instance of move_pages() from already migrating away and freeing
that page by then?  A reference to the page should be taken while
holding suitable lock (huge_pte_lockptr?), to serialize against
concurrent migration.

But I'm not prepared to rework the hugetlb locking here myself;
so for now just supply a patch to copy e632a938d914's get_page()
after follow_huge_pmd() to after follow_huge_addr(): removing
the BUG_ON(flags & FOLL_GET), but probably leaving a race.

Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Reported-by: Michael Ellerman <mpe@xxxxxxxxxxxxxx>
Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx # 3.12+
---
Whether this is a patch that should go in without fixing the locking,
I don't know.  An unlikely race is better than a triggerable BUG?
Or perhaps I'm just wrong about there being any such race.

 mm/memory.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

--- 3.15-rc7/mm/memory.c	2014-04-27 23:55:53.608801152 -0700
+++ linux/mm/memory.c	2014-05-28 13:05:48.340124615 -0700
@@ -1486,7 +1486,17 @@ struct page *follow_page_mask(struct vm_
 
 	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
 	if (!IS_ERR(page)) {
-		BUG_ON(flags & FOLL_GET);
+		if (page && (flags & FOLL_GET)) {
+			/*
+			 * Refcount on tail pages are not well-defined and
+			 * shouldn't be taken. The caller should handle a NULL
+			 * return when trying to follow tail pages.
+			 */
+			if (PageHead(page))
+				get_page(page);
+			else
+				page = NULL;
+		}
 		goto out;
 	}
 
--
To unsubscribe from this list: send the line "unsubscribe trinity" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html