On Fri 20-07-12 15:11:08, Mel Gorman wrote: > Sorry for the resend, I did not properly refresh Cong Wang's suggested > fix. This V2 is still the mmap_sem approach that fixes a potential deadlock > problem pointed out by Michal. > > Changelog since V1 > o Correct cut&paste error in race description (hugh) > o Handle potential deadlock during fork (mhocko) > o Reorder unlocking (wangcong) > > If a process creates a large hugetlbfs mapping that is eligible for page > table sharing and forks heavily with children some of whom fault and > others which destroy the mapping then it is possible for page tables to > get corrupted. Some teardowns of the mapping encounter a "bad pmd" and > output a message to the kernel log. The final teardown will trigger a > BUG_ON in mm/filemap.c. > > This was reproduced in 3.4 but is known to have existed for a long time > and goes back at least as far as 2.6.37. It was probably was introduced in > 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages > look like this; > > [ ..........] Lots of bad pmd messages followed by this > [ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7). > [ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7). > [ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7). > [ 127.186778] ------------[ cut here ]------------ > [ 127.186781] kernel BUG at mm/filemap.c:134! > [ 127.186782] invalid opcode: 0000 [#1] SMP > [ 127.186783] CPU 7 > [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod > [ 127.186801] > [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR > [ 127.186804] RIP: 0010:[<ffffffff810ed6ce>] [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160 > [ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002 > [ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0 > [ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00 > [ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003 > [ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8 > [ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8 > [ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000 > [ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0 > [ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0) > [ 127.186821] Stack: > [ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b > [ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98 > [ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000 > [ 127.186827] Call Trace: > [ 127.186829] [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80 > [ 127.186832] [<ffffffff811bc925>] truncate_hugepages+0x115/0x220 > [ 127.186834] [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30 > [ 127.186837] [<ffffffff811655c7>] evict+0xa7/0x1b0 > [ 127.186839] [<ffffffff811657a3>] iput_final+0xd3/0x1f0 > [ 127.186840] [<ffffffff811658f9>] iput+0x39/0x50 > [ 127.186842] [<ffffffff81162708>] d_kill+0xf8/0x130 > [ 127.186843] [<ffffffff81162812>] dput+0xd2/0x1a0 > [ 127.186845] [<ffffffff8114e2d0>] __fput+0x170/0x230 > [ 127.186848] [<ffffffff81236e0e>] ? rb_erase+0xce/0x150 > [ 127.186849] [<ffffffff8114e3ad>] fput+0x1d/0x30 > [ 127.186851] [<ffffffff81117db7>] remove_vma+0x37/0x80 > [ 127.186853] [<ffffffff81119182>] do_munmap+0x2d2/0x360 > [ 127.186855] [<ffffffff811cc639>] sys_shmdt+0xc9/0x170 > [ 127.186857] [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b > [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0 > [ 127.186868] RIP [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160 > [ 127.186870] RSP <ffff8804144b5c08> > [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]--- > > The bug is a race and not always easy to reproduce. To reproduce it I was > doing the following on a single socket I7-based machine with 16G of RAM. > > $ hugeadm --pool-pages-max DEFAULT:13G > $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax > $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall > $ for i in `seq 1 9000`; do ./hugetlbfs-test; done > > On my particular machine, it usually triggers within 10 minutes but enabling > debug options can change the timing such that it never hits. Once the bug is > triggered, the machine is in trouble and needs to be rebooted. The machine > will respond but processes accessing proc like "ps aux" will hang due to > the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b. > > The test case was mostly written by Michal Hocko with a few minor changes > by me to reproduce this bug. Michal did a lot of heavy lifting eliminating > possible sources of the race and saved me the embarrassment of posting a > completely broken patch yesterday. He did not see this patch before > going to the lists so any critical flaws are all mine! > > The basic problem is a race between page table sharing and teardown. For > the most part page table sharing depends on i_mmap_mutex. In some cases, > it is also taking the mm->page_table_lock for the PTE updates but with > shared page tables, it is the i_mmap_mutex that is more important. > > Unfortunately it appears to be also insufficient. Consider the following > situation > > Process A Process B > --------- --------- > hugetlb_fault shmdt > LockWrite(mmap_sem) > do_munmap > unmap_region > unmap_vmas > unmap_single_vma > unmap_hugepage_range > Lock(i_mmap_mutex) > Lock(mm->page_table_lock) > huge_pmd_unshare/unmap tables <--- (1) > Unlock(mm->page_table_lock) > Unlock(i_mmap_mutex) > huge_pte_alloc ... > Lock(i_mmap_mutex) ... > vma_prio_walk, find svma, spte ... > Lock(mm->page_table_lock) ... > share spte ... > Unlock(mm->page_table_lock) ... > Unlock(i_mmap_mutex) ... > hugetlb_no_page <--- (2) > free_pgtables > unlink_file_vma > hugetlb_free_pgd_range > remove_vma_list > > In this scenario, it is possible for Process A to share page tables with > Process B that is trying to tear them down. The i_mmap_mutex on its own > does not prevent Process A walking Process B's page tables. At (1) above, > the page tables are not shared yet so it unmaps the PMDs. Process A sets > up page table sharing and at (2) faults a new entry. Process B then trips > up on it in free_pgtables. > > This patch takes the mmap_sem for read and then the page_table_lock of > address spaces being considered for page table sharing. I verified that > page table sharing still occurs using the awesome technology of printk > to spit out a message when huge_pmd_share is successful. libhugetlbfs > regression test suite passed. > > I strongly suggest this be treated as a -stable candidate if it is merged. > > Test program is as follows. > > ==== CUT HERE ==== > > static size_t huge_page_size = (2UL << 20); > static size_t nr_huge_page_A = 512; > static size_t nr_huge_page_B = 5632; > > unsigned int get_random(unsigned int max) > { > struct timeval tv; > > gettimeofday(&tv, NULL); > srandom(tv.tv_usec); > return random() % max; > } > > static void play(void *addr, size_t size) > { > unsigned char *start = addr, > *end = start + size, > *a; > start += get_random(size/2); > > /* we could itterate on huge pages but let's give it more time. */ > for (a = start; a < end; a += 4096) > *a = 0; > } > > int main(int argc, char **argv) > { > key_t key = IPC_PRIVATE; > size_t sizeA = nr_huge_page_A * huge_page_size; > size_t sizeB = nr_huge_page_B * huge_page_size; > int shmidA, shmidB; > void *addrA = NULL, *addrB = NULL; > int nr_children = 300, n = 0; > > if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) { > perror("shmget:"); > return 1; > } > > if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) { > perror("shmat"); > return 1; > } > if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) { > perror("shmget:"); > return 1; > } > > if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) { > perror("shmat"); > return 1; > } > > fork_child: > switch(fork()) { > case 0: > switch (n%3) { > case 0: > play(addrA, sizeA); > break; > case 1: > play(addrB, sizeB); > break; > case 2: > break; > } > break; > case -1: > perror("fork:"); > break; > default: > if (++n < nr_children) > goto fork_child; > play(addrA, sizeA); > break; > } > shmdt(addrA); > shmdt(addrB); > do { > wait(NULL); > } while (--n > 0); > shmctl(shmidA, IPC_RMID, NULL); > shmctl(shmidB, IPC_RMID, NULL); > return 0; > } > > Signed-off-by: Mel Gorman <mgorman@xxxxxxx> Yes this looks correct. mmap_sem will make sure that unmap_vmas and free_pgtables are executed atomicaly wrt. huge_pmd_share so it doesn't see non-NULL spte on the way out. I am just wondering whether we need the page_table_lock as well. It is not harmful but I guess we can drop it because both exit_mmap and shmdt are not taking it and mmap_sem is sufficient for them. One more nit bellow. I will send my version of the fix which took another path as a reply to this email to have something to compare with. [...] > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c > index f6679a7..944b2df 100644 > --- a/arch/x86/mm/hugetlbpage.c > +++ b/arch/x86/mm/hugetlbpage.c > @@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr) > /* > * search for a shareable pmd page for hugetlb. > */ > -static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > +static void huge_pmd_share(struct mm_struct *mm, struct mm_struct *locked_mm, > + unsigned long addr, pud_t *pud) > { > struct vm_area_struct *vma = find_vma(mm, addr); > struct address_space *mapping = vma->vm_file->f_mapping; > @@ -68,14 +69,40 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > struct vm_area_struct *svma; > unsigned long saddr; > pte_t *spte = NULL; > + spinlock_t *spage_table_lock = NULL; > + struct rw_semaphore *smmap_sem = NULL; > > if (!vma_shareable(vma, addr)) > return; > > +retry: > mutex_lock(&mapping->i_mmap_mutex); > vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) { > if (svma == vma) > continue; > + if (svma->vm_mm == vma->vm_mm) > + continue; > + > + /* > + * The target mm could be in the process of tearing down > + * its page tables and the i_mmap_mutex on its own is > + * not sufficient. To prevent races against teardown and > + * pagetable updates, we acquire the mmap_sem and pagetable > + * lock of the remote address space. down_read_trylock() > + * is necessary as the other process could also be trying > + * to share pagetables with the current mm. In the fork > + * case, we are already both mm's so check for that > + */ > + if (locked_mm != svma->vm_mm) { > + if (!down_read_trylock(&svma->vm_mm->mmap_sem)) { > + mutex_unlock(&mapping->i_mmap_mutex); > + goto retry; > + } > + smmap_sem = &svma->vm_mm->mmap_sem; > + } > + > + spage_table_lock = &svma->vm_mm->page_table_lock; > + spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING); > > saddr = page_table_shareable(svma, vma, addr, idx); > if (saddr) { [...] > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index ae8f708..4832277 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2244,7 +2244,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > src_pte = huge_pte_offset(src, addr); > if (!src_pte) > continue; > - dst_pte = huge_pte_alloc(dst, addr, sz); > + dst_pte = huge_pte_alloc(dst, src, addr, sz); > if (!dst_pte) > goto nomem; > > @@ -2745,7 +2745,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > VM_FAULT_SET_HINDEX(h - hstates); > } > > - ptep = huge_pte_alloc(mm, address, huge_page_size(h)); > + ptep = huge_pte_alloc(mm, NULL, address, huge_page_size(h)); strictly speaking we should provide current->mm here because we are in the page fault path and mmap_sem is held for reading. This doesn't matter here though because huge_pmd_share will take it for reading so nesting wouldn't hurt. Maybe a small comment that this is intentional and correct would be nice. > if (!ptep) > return VM_FAULT_OOM; > > > -- > Mel Gorman > SUSE Labs -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>