Re: [PATCH] mm, thp: close race between mremap() and split_huge_page()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, May 06, 2014 at 01:13:31AM +0300, Kirill A. Shutemov wrote:
> It's critical for split_huge_page() (and migration) to catch and freeze
> all PMDs on rmap walk. It gets tricky if there's concurrent fork() or
> mremap() since usually we copy/move page table entries on dup_mm() or
> move_page_tables() without rmap lock taken. To get it work we rely on
> rmap walk order to not miss any entry. We expect to see destination VMA
> after source one to work correctly.
> 
> But after switching rmap implementation to interval tree it's not always
> possible to preserve expected walk order.
> 
> It works fine for dup_mm() since new VMA has the same vma_start_pgoff()
> / vma_last_pgoff() and explicitly insert dst VMA after src one with
> vma_interval_tree_insert_after().
> 
> But on move_vma() destination VMA can be merged into adjacent one and as
> result shifted left in interval tree. Fortunately, we can detect the
> situation and prevent race with rmap walk by moving page table entries
> under rmap lock. See commit 38a76013ad80.
> 
> Problem is that we miss the lock when we move transhuge PMD. Most likely
> this bug caused the crash[1].
> 
> [1] http://thread.gmane.org/gmane.linux.kernel.mm/96473

It took a night but I was able to trigger crash which this patch fixes.

Test case:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/wait.h>

#define MB (1024UL*1024)
#define SIZE (4*MB)
#define BASE ((void *)0x400000000000)

int main()
{
	char *x1, *x2;

	for (;;) {
		x1 = mmap(BASE, 2 * SIZE, PROT_READ | PROT_WRITE,
			MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE | MAP_FIXED,
			-1, 0);
                if (x1 == MAP_FAILED)
                        perror("x1"), exit(1);
		x2 = mremap(x1 + SIZE, SIZE, SIZE,
				MREMAP_FIXED | MREMAP_MAYMOVE,
				x1 + 2 * SIZE);
                if (x2 == MAP_FAILED)
                        perror("x2"), exit(1);

		if (!fork())
			return 0;

		if (!fork()) {
			if (!fork())
				return 0;

			mprotect(x2, 4096, PROT_NONE);
			return 0;
		}

		x2 = mremap(x2, SIZE, SIZE,
				MREMAP_FIXED | MREMAP_MAYMOVE,
				x1 + SIZE);
		if (x2 == MAP_FAILED)
			perror("x2"), exit(1);
		munmap(x1, SIZE);
		munmap(x2, SIZE);
		while (waitpid(-1, NULL, WNOHANG) > 0);
	}
	return 0;
}

Crash:

[54438.764230] mapcount 2 page_mapcount 3
[54438.764985] ------------[ cut here ]------------
[54438.765735] kernel BUG at /home/space/kas/git/public/linux/mm/huge_memory.c:1836!
[54438.766926] invalid opcode: 0000 [#1] SMP 
[54438.767637] Modules linked in:
[54438.768078] CPU: 0 PID: 12638 Comm: test_split Not tainted 3.15.0-rc4-00001-gdb77ce6c9fe5-dirty #1282
[54438.768078] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
[54438.768078] task: ffff8804633c8410 ti: ffff88046376c000 task.ti: ffff88046376c000
[54438.768078] RIP: 0010:[<ffffffff81140594>]  [<ffffffff81140594>] split_huge_page_to_list+0x434/0x6c0
[54438.768078] RSP: 0018:ffff88046376dcc8  EFLAGS: 00010297
[54438.768078] RAX: 0000000000000003 RBX: ffff88046881c520 RCX: 0000000000000006
[54438.768078] RDX: 0000000000000006 RSI: ffff8804633c8b18 RDI: ffff8804633c8410
[54438.768078] RBP: ffff88046376dd30 R08: 0000000000000001 R09: 0000000000000000
[54438.768078] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[54438.768078] R13: 0000400000800000 R14: ffffea000ede4000 R15: 0000000400000400
[54438.768078] FS:  00007fea6a7be700(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[54438.768078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[54438.768078] CR2: 00007fea6a2db7d0 CR3: 0000000469bdf000 CR4: 00000000001407f0
[54438.768078] Stack:
[54438.768078]  ffff8804698f4020 0000400000a00000 0000000000000000 ffff880467a04900
[54438.768078]  0000000000000000 ffff880467a04880 ffff880400000002 ffff880462b5ccf8
[54438.768078]  0000400000800000 ffff88046370ac50 ffff8804698f4020 0000400000a00000
[54438.768078] Call Trace:
[54438.768078]  [<ffffffff81141050>] __split_huge_page_pmd+0xc0/0x1f0
[54438.768078]  [<ffffffff8114196e>] split_huge_page_pmd_mm+0x3e/0x40
[54438.768078]  [<ffffffff81141995>] split_huge_page_address+0x25/0x30
[54438.768078]  [<ffffffff81141a3c>] __vma_adjust_trans_huge+0x9c/0xf0
[54438.768078]  [<ffffffff8132268d>] ? __rb_insert_augmented+0xcd/0x1f0
[54438.768078]  [<ffffffff81116f06>] vma_adjust+0x626/0x6a0
[54438.768078]  [<ffffffff811170ad>] __split_vma.isra.35+0x12d/0x200
[54438.768078]  [<ffffffff81117e94>] split_vma+0x24/0x30
[54438.768078]  [<ffffffff8111a3ca>] mprotect_fixup+0x22a/0x260
[54438.768078]  [<ffffffff8111a542>] SyS_mprotect+0x142/0x230
[54438.768078]  [<ffffffff8173cb62>] system_call_fastpath+0x16/0x1b
[54438.768078] Code: 0f 1f 80 00 00 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 49 8b 16 4c 89 f0 80 
[54438.768078] RIP  [<ffffffff81140594>] split_huge_page_to_list+0x434/0x6c0
[54438.768078]  RSP <ffff88046376dcc8>
[54438.805154] ---[ end trace 12d4dde45cf392c6 ]---


-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]