Michal Hocko wrote: > [CCing Kirill] > > On Mon 10-02-14 09:25:01, Mizuma, Masayoshi wrote: > > Hi, > > Hi, > > > This is a bug report for memory cgroup hang up. > > I reproduced this using 3.14-rc1 but I couldn't in 3.7. > > > > When I ran a program (see below) under a limit of memcg, the process hanged up. > > Using kprobe trace, I detected the hangup in __handle_mm_fault(). > > do_huge_pmd_wp_page(), which is called by __handle_mm_fault(), always returns > > VM_FAULT_OOM, so it repeats goto retry and the task can't be killed. > > Thanks a lot for this very good report. I would bet the issue is related > to the THP zero page. > > __handle_mm_fault retry loop for VM_FAULT_OOM from do_huge_pmd_wp_page > expects that the pmd is marked for splitting so that it can break out > and retry the fault. This is not the case for THP zero page though. > do_huge_pmd_wp_page checks is_huge_zero_pmd and goes to allocate a new > huge page which will succeed in your case because you are hitting memcg > limit not the global memory pressure. But then a new page is charged by > mem_cgroup_newpage_charge which fails. An existing page is then split > and we are returning VM_FAULT_OOM. But we do not have page initialized > in that path because page = pmd_page(orig_pmd) is called after > is_huge_zero_pmd check. > > I am not familiar with THP zero page code much but I guess splitting > such a zero page is not a way to go. Instead we should simply drop the > zero page and retry the fault. I would assume that one of > do_huge_pmd_wp_zero_page_fallback or do_huge_pmd_wp_page_fallback should > do the trick but both of them try to charge new page(s) before the > current zero page is uncharged. That makes it prone to the same issue > AFAICS. > > But may be Kirill has a better idea. Your analysis looks accurate. Although I was not able to reproduce hang up. The problem with do_huge_pmd_wp_zero_page_fallback() that it can return VM_FAULT_OOM if it failed to allocate new *small* page, so it's real OOM. Untested patch below tries to fix. Masayoshi, could you test. BTW, Michal, I've triggered sleep-in-atomic bug in mem_cgroup_print_oom_info(): [ 2.386563] Task in /test killed as a result of limit of /test [ 2.387326] memory: usage 10240kB, limit 10240kB, failcnt 51 [ 2.388098] memory+swap: usage 10240kB, limit 10240kB, failcnt 0 [ 2.388861] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0 [ 2.389640] Memory cgroup stats for /test: [ 2.390178] BUG: sleeping function called from invalid context at /home/space/kas/git/public/linux/kernel/cpu.c:68 [ 2.391516] in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test [ 2.392416] 2 locks held by memcg_test/66: [ 2.392945] #0: (memcg_oom_lock#2){+.+...}, at: [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90 [ 2.394233] #1: (oom_info_lock){+.+...}, at: [<ffffffff81197b2a>] mem_cgroup_print_oom_info+0x2a/0x390 [ 2.395496] CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745 [ 2.396536] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011 [ 2.397540] ffffffff81a3cc90 ffff88081d26dba0 ffffffff81776ea3 0000000000000000 [ 2.398541] ffff88081d26dbc8 ffffffff8108418a 0000000000000000 ffff88081d15c000 [ 2.399533] 0000000000000000 ffff88081d26dbd8 ffffffff8104f6bc ffff88081d26dc10 [ 2.400588] Call Trace: [ 2.400908] [<ffffffff81776ea3>] dump_stack+0x4d/0x66 [ 2.401578] [<ffffffff8108418a>] __might_sleep+0x16a/0x210 [ 2.402295] [<ffffffff8104f6bc>] get_online_cpus+0x1c/0x60 [ 2.403005] [<ffffffff8118fb67>] mem_cgroup_read_stat+0x27/0xb0 [ 2.403769] [<ffffffff81197d60>] mem_cgroup_print_oom_info+0x260/0x390 [ 2.404653] [<ffffffff8177314e>] dump_header+0x88/0x251 [ 2.405342] [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10 [ 2.406098] [<ffffffff81130618>] oom_kill_process+0x258/0x3d0 [ 2.406833] [<ffffffff81198746>] mem_cgroup_oom_synchronize+0x656/0x6c0 [ 2.407674] [<ffffffff811973a0>] ? mem_cgroup_charge_common+0xd0/0xd0 [ 2.408553] [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90 [ 2.409354] [<ffffffff817712f7>] mm_fault_error+0x91/0x189 [ 2.410069] [<ffffffff81783eae>] __do_page_fault+0x48e/0x580 [ 2.410791] [<ffffffff8108f656>] ? local_clock+0x16/0x30 [ 2.411467] [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10 [ 2.412248] [<ffffffff8177f6fc>] ? _raw_spin_unlock_irq+0x2c/0x40 [ 2.413039] [<ffffffff8108312b>] ? finish_task_switch+0x7b/0x100 [ 2.413821] [<ffffffff813b954a>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 2.414652] [<ffffffff81783fae>] do_page_fault+0xe/0x10 [ 2.415330] [<ffffffff81780552>] page_fault+0x22/0x30 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 82166bf974e1..974eb9eea2c0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1166,8 +1166,10 @@ alloc: } else { ret = do_huge_pmd_wp_page_fallback(mm, vma, address, pmd, orig_pmd, page, haddr); - if (ret & VM_FAULT_OOM) + if (ret & VM_FAULT_OOM) { split_huge_page(page); + ret |= VM_FAULT_FALLBACK; + } put_page(page); } count_vm_event(THP_FAULT_FALLBACK); @@ -1179,9 +1181,12 @@ alloc: if (page) { split_huge_page(page); put_page(page); + ret |= VM_FAULT_FALLBACK; + } else { + ret = do_huge_pmd_wp_zero_page_fallback(mm, vma, + address, pmd, orig_pmd, haddr); } count_vm_event(THP_FAULT_FALLBACK); - ret |= VM_FAULT_OOM; goto out; } diff --git a/mm/memory.c b/mm/memory.c index be6a0c0d4ae0..3b57b7864667 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3703,7 +3703,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); -retry: pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); if (!pud) @@ -3741,20 +3740,13 @@ retry: if (dirty && !pmd_write(orig_pmd)) { ret = do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); - /* - * If COW results in an oom, the huge pmd will - * have been split, so retry the fault on the - * pte for a smaller charge. - */ - if (unlikely(ret & VM_FAULT_OOM)) - goto retry; - return ret; + if (!(ret & VM_FAULT_FALLBACK)) + return ret; } else { huge_pmd_set_accessed(mm, vma, address, pmd, orig_pmd, dirty); + return 0; } - - return 0; } } -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>