On Tue 30-12-14 15:42:56, Tetsuo Handa wrote: [...] > We might want to discuss below case as a separate topic, but is a TIF_MEMDIE > stall anyway. I retested using 3.19-rc2 with diff shown below. If I start > a.out and b.out (where b.out is a copy of a.out) with slight delay (a few > deciseconds), I can observe that the a.out is unable to die due to b.out > asking for memory or holding lock. > http://I-love.SAKURA.ne.jp/tmp/serial-20141230-ab-1.txt.xz is a case > where I think a.out keeps the OOM killer disabled and [ 53.748454] b.out invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 [...] [ 53.807397] active_anon:448903 inactive_anon:2082 isolated_anon:0 [ 53.807397] active_file:0 inactive_file:9 isolated_file:0 [ 53.807397] unevictable:0 dirty:3 writeback:0 unstable:0 [ 53.807397] free:13079 slab_reclaimable:1227 slab_unreclaimable:4520 [ 53.807397] mapped:380 shmem:2151 pagetables:2059 bounce:0 [ 53.807397] free_cma:0 [...] [ 53.856598] Free swap = 0kB [ 53.857908] Total swap = 0kB [ 53.859218] 524157 pages RAM This situation looks quite hopeless. We cannot swap yet we have over 80% of memory occupied by anon memory. There is still around ~50M free and few pages in the reclaimable slab which should be sufficient to help TIF_MEMDIE to make some progress on the other hand. [ 54.380517] Out of memory: Kill process 3596 (a.out) score 719 or sacrifice child [ 54.382091] Killed process 3596 (a.out) total-vm:2166864kB, anon-rss:1383880kB, file-rss:4kB [...] [ 348.134718] a.out D ffff880036fefcb8 0 3596 1 0x00100084 [ 348.136616] ffff880036fefcb8 ffff880036fefc88 ffff88007c204550 00000000000130c0 [ 348.138645] ffff880036feffd8 00000000000130c0 ffff88007c204550 ffff880036fefcb8 [ 348.140657] ffff88007ca45248 ffff88007ca4524c ffff88007c204550 00000000ffffffff [ 348.142672] Call Trace: [ 348.143662] [<ffffffff815bddb4>] schedule_preempt_disabled+0x24/0x70 [ 348.145379] [<ffffffff815bfb65>] __mutex_lock_slowpath+0xb5/0x120 [ 348.147153] [<ffffffff815bfbee>] mutex_lock+0x1e/0x32 [ 348.148644] [<ffffffffa02463ca>] xfs_file_buffered_aio_write.isra.15+0x6a/0x200 [xfs] [ 348.150637] [<ffffffff8100d62f>] ? __switch_to+0x15f/0x580 [ 348.152209] [<ffffffffa02465dd>] xfs_file_write_iter+0x7d/0x120 [xfs] [ 348.153961] [<ffffffff81178009>] new_sync_write+0x89/0xd0 [ 348.155506] [<ffffffff811787f2>] vfs_write+0xb2/0x1f0 [ 348.157004] [<ffffffff8101b994>] ? do_audit_syscall_entry+0x64/0x70 [ 348.158715] [<ffffffff81179440>] SyS_write+0x50/0xc0 [ 348.160188] [<ffffffff810f9ffe>] ? __audit_syscall_exit+0x22e/0x2d0 and this is the case for most a.out and b.out threads basically because all of them contend on a single file. The holder of the lock right now seems to be: [ 355.559722] b.out R running task 0 3843 3724 0x00000080 [ 355.561700] MemAlloc: 21916 jiffies on 0x10 [ 355.563056] ffff88007c3f3808 ffff88007c3f37d8 ffff88007c3e4d60 00000000000130c0 [ 355.565346] ffff88007c3f3fd8 00000000000130c0 ffff88007c3e4d60 ffff880036f02b48 [ 355.567440] ffffffff81848588 0000000000000400 0000000000000000 ffff88007c3f39c8 [ 355.569517] Call Trace: [ 355.570557] [<ffffffff815bdc72>] _cond_resched+0x22/0x40 [ 355.572167] [<ffffffff811249f2>] shrink_node_slabs+0x242/0x310 [ 355.573846] [<ffffffff81127155>] shrink_zone+0x175/0x1c0 [ 355.575410] [<ffffffff81127590>] do_try_to_free_pages+0x1d0/0x3e0 [ 355.577339] [<ffffffff81127834>] try_to_free_pages+0x94/0xc0 [ 355.579015] [<ffffffff8111d4c5>] __alloc_pages_nodemask+0x535/0xaa0 [ 355.580759] [<ffffffff8115cf9c>] alloc_pages_current+0x8c/0x100 [ 355.582446] [<ffffffff811148f7>] __page_cache_alloc+0xa7/0xc0 [ 355.584092] [<ffffffff81115364>] pagecache_get_page+0x54/0x1b0 [ 355.585773] [<ffffffffa025d11e>] ? xfs_trans_commit+0x13e/0x230 [xfs] [ 355.587553] [<ffffffff811154e8>] grab_cache_page_write_begin+0x28/0x50 [ 355.589349] [<ffffffffa023b04f>] xfs_vm_write_begin+0x2f/0xe0 [xfs] [ 355.591096] [<ffffffff8111465c>] generic_perform_write+0xbc/0x1c0 [ 355.592816] [<ffffffffa024634f>] ? xfs_file_aio_write_checks+0xdf/0xf0 [xfs] [ 355.594718] [<ffffffffa024642f>] xfs_file_buffered_aio_write.isra.15+0xcf/0x200 [xfs] So it is trying to reclaim at least something but it will take some time for it to realize this will not fly. The allocation will fail eventually, though, because this is !__GFP_FS allocation and the same will apply to a.out waiting for the lock as well. $ grep "waited for.*select_bad_process" serial-20141230-ab-1.txt | sed 's@.*\((pid=.*waited for.*\) for.*@\1@' | sort | uniq -c 1 (pid=2,flags=0x2000d0) waited for a.out(pid=3596,flags=0x0) 809 (pid=3724,flags=0x280da) waited for a.out(pid=3596,flags=0x0) [ 351.915586] b.out R running task 0 3724 3572 0x00000080 [ 351.917619] MemAlloc: 29906 jiffies on 0x10 [ 351.919012] ffff88007b8d7948 ffff88007fffc6c0 ffff88007c5751b0 00000000000130c0 [ 351.921096] ffff88007b8d7fd8 00000000000130c0 ffff88007c5751b0 0000000000000000 [ 351.923228] 0000000000000000 00000000000280da 0000000000000002 0000000000000000 [ 351.925374] Call Trace: [ 351.926466] [<ffffffff815bdc72>] _cond_resched+0x22/0x40 [ 351.928073] [<ffffffff8111d477>] __alloc_pages_nodemask+0x4e7/0xaa0 [ 351.929828] [<ffffffff8115f302>] alloc_pages_vma+0x92/0x160 [ 351.931502] [<ffffffff8113fa11>] handle_mm_fault+0xbe1/0xed0 [ 351.933171] [<ffffffff815c2847>] ? native_iret+0x7/0x7 [ 351.934719] [<ffffffff8105502c>] __do_page_fault+0x1dc/0x5b0 [ 351.936412] [<ffffffff8111d125>] ? __alloc_pages_nodemask+0x195/0xaa0 [ 351.938191] [<ffffffff81055431>] do_page_fault+0x31/0x70 [ 351.939769] [<ffffffff815c3638>] page_fault+0x28/0x30 [ 351.941322] [<ffffffff812b1940>] ? __clear_user+0x20/0x50 [ 351.942921] [<ffffffff81139538>] iov_iter_zero+0x68/0x2f0 [ 351.944503] [<ffffffff8138a4e7>] read_iter_zero+0x47/0xb0 [ 351.946135] [<ffffffff81177f46>] new_sync_read+0x86/0xc0 [ 351.947703] [<ffffffff811791b3>] __vfs_read+0x13/0x50 [ 351.949216] [<ffffffff81179271>] vfs_read+0x81/0x140 [ 351.950757] [<ffffffff81179380>] SyS_read+0x50/0xc0 [ 351.952277] [<ffffffff810f9ffe>] ? __audit_syscall_exit+0x22e/0x2d0 [ 351.953995] [<ffffffff815c1c29>] system_call_fastpath+0x12/0x17 So the OOM blocked task is sitting in the page fault caused by clearing the user buffer. According to your debugging patch this should be GFP_HIGHUSER_MOVABLE | __GFP_ZERO allocation which is the case where we retry without failing most of the time. I am not familiar with the VFS code much but it seems we are not sitting on any locks that would block the OOM victim later on (I am not entirely sure about FDPUT_POS_UNLOCK from fdget_pos but all tasks are past this calling it without blocking so it shouldn't matter). So even if the page fault failed with ENOMEM it wouldn't help us much here. That being said this doesn't look like a live lock or a lockup. System should recover from this state but it might take a lot of time (there are hundreds of tasks waiting on the i_mutex lock, each will try to allocate and fail and OOM victims will have to get out of the kernel and die). I am not sure we can do much about that from the allocator POV. A possible way would be refraining from the reclaim efforts when it is clear that nothing is really reclaimable. But I suspect this would be tricky to get right. > http://I-love.SAKURA.ne.jp/tmp/serial-20141230-ab-2.txt.xz is a case [ 44.588785] Out of memory: Kill process 3599 (a.out) score 773 or sacrifice child [ 44.590418] Killed process 3599 (a.out) total-vm:2166864kB, anon-rss:1488688kB, file-rss:4kB [...] [ 44.640689] a.out: page allocation failure: order:0, mode:0x280da [ 44.640690] CPU: 2 PID: 3599 Comm: a.out Not tainted 3.19.0-rc2+ #20 [...] [ 44.641125] a.out: page allocation failure: order:0, mode:0x2015a [ 44.641126] CPU: 2 PID: 3599 Comm: a.out Not tainted 3.19.0-rc2+ #20 So the OOM victim is failing the allocation because we prevent endless loops in the allocator for TIF_MEMDIE tasks and then it dies (it is not among Sysrq+t output AFAICS). We still have to wait for all the tasks sharing mm with it. many of them are in: [ 402.300859] a.out x ffff88007be53ce8 0 3601 1 0x00000086 [ 402.303407] ffff88007be53ce8 ffff88007c962450 ffff880078d10e60 00000000000130c0 [ 402.305478] ffff88007be53fd8 00000000000130c0 ffff880078d10e60 ffff880078d114a8 [ 402.307519] ffff880078d114a8 ffff880078d11170 ffff88007c0a9220 ffff880078d10e60 [ 402.309547] Call Trace: [ 402.310551] [<ffffffff815bd8c4>] schedule+0x24/0x70 [ 402.312040] [<ffffffff8106a4ea>] do_exit+0x6ba/0xb10 [ 402.313531] [<ffffffff8106b7da>] do_group_exit+0x3a/0xa0 [ 402.315082] [<ffffffff81075de8>] get_signal+0x188/0x690 [ 402.316629] [<ffffffff815bd43a>] ? __schedule+0x27a/0x6e0 [ 402.318196] [<ffffffff8100e4f2>] do_signal+0x32/0x750 [ 402.319744] [<ffffffffa02611c4>] ? _xfs_log_force_lsn+0xc4/0x2f0 [xfs] [ 402.321729] [<ffffffffa0245489>] ? xfs_file_fsync+0x159/0x1b0 [xfs] [ 402.323461] [<ffffffff8100ec5c>] do_notify_resume+0x4c/0x90 [ 402.325135] [<ffffffff815c1ec7>] int_signal+0x12/0x17 so they have already dropped reference to mm_struct but some of them are still waiting in the write path to fail and exit: [ 402.271983] a.out D ffff88007c047cb8 0 3600 1 0x00000084 [ 402.273866] ffff88007c047cb8 ffff88007c047c88 ffff8800793d8ba0 00000000000130c0 [ 402.275872] ffff88007c047fd8 00000000000130c0 ffff8800793d8ba0 ffff88007c047cb8 [ 402.277878] ffff88007ae56a48 ffff88007ae56a4c ffff8800793d8ba0 00000000ffffffff [ 402.279888] Call Trace: [ 402.280874] [<ffffffff815bddb4>] schedule_preempt_disabled+0x24/0x70 [ 402.282597] [<ffffffff815bfb65>] __mutex_lock_slowpath+0xb5/0x120 [ 402.284266] [<ffffffff815bfbee>] mutex_lock+0x1e/0x32 [ 402.285756] [<ffffffffa02463ca>] xfs_file_buffered_aio_write.isra.15+0x6a/0x200 [xfs] [ 402.287741] [<ffffffff8100d62f>] ? __switch_to+0x15f/0x580 [ 402.289311] [<ffffffffa02465dd>] xfs_file_write_iter+0x7d/0x120 [xfs] [ 402.291050] [<ffffffff81178009>] new_sync_write+0x89/0xd0 [ 402.292596] [<ffffffff811787f2>] vfs_write+0xb2/0x1f0 [ 402.294075] [<ffffffff8101b994>] ? do_audit_syscall_entry+0x64/0x70 [ 402.295774] [<ffffffff81179440>] SyS_write+0x50/0xc0 [ 402.297239] [<ffffffff810f9ffe>] ? __audit_syscall_exit+0x22e/0x2d0 [ 402.298947] [<ffffffff815c1c29>] system_call_fastpath+0x12/0x17 while one of them is holding the lock: [ 402.736525] a.out R running task 0 3617 1 0x00000084 [ 402.738452] MemAlloc: 358299 jiffies on 0x10 [ 402.739812] ffff88007ba63808 ffff88007ba637d8 ffff8800792f2510 00000000000130c0 [ 402.741972] ffff88007ba63fd8 00000000000130c0 ffff8800792f2510 ffff880078d1bb48 [ 402.744029] ffffffff81848588 0000000000000400 0000000000000000 ffff88007ba639c8 [ 402.746135] Call Trace: [ 402.747153] [<ffffffff815bdc72>] _cond_resched+0x22/0x40 [ 402.748718] [<ffffffff811249f2>] shrink_node_slabs+0x242/0x310 [ 402.750432] [<ffffffff81127155>] shrink_zone+0x175/0x1c0 [ 402.751996] [<ffffffff81127590>] do_try_to_free_pages+0x1d0/0x3e0 [ 402.753686] [<ffffffff81127834>] try_to_free_pages+0x94/0xc0 [ 402.755325] [<ffffffff8111d4c5>] __alloc_pages_nodemask+0x535/0xaa0 [ 402.757057] [<ffffffff8115cf9c>] alloc_pages_current+0x8c/0x100 [ 402.758725] [<ffffffff811148f7>] __page_cache_alloc+0xa7/0xc0 [ 402.760362] [<ffffffff81115364>] pagecache_get_page+0x54/0x1b0 [ 402.762004] [<ffffffff811154e8>] grab_cache_page_write_begin+0x28/0x50 [ 402.763787] [<ffffffffa023b04f>] xfs_vm_write_begin+0x2f/0xe0 [xfs] [ 402.765516] [<ffffffff8111465c>] generic_perform_write+0xbc/0x1c0 [ 402.767203] [<ffffffffa024634f>] ? xfs_file_aio_write_checks+0xdf/0xf0 [xfs] [ 402.769078] [<ffffffffa024642f>] xfs_file_buffered_aio_write.isra.15+0xcf/0x200 [xfs] So this is basically the same as the previous one we just see it in a slightly better shape because many threads managed to exit already. > where I think a.out cannot die within reasonable duration due to b.out . I am not sure you can have any reasonable time expectation with such a huge contention on a single file. Even killing the task manually would take quite some time I suspect. Sure, memory pressure makes it all much worse. > I don't know whether cgroups can help or not, Memory cgroups would help you to limit the amount of anon memory but you would have to be really careful about the potential overcomit due to other allocations from outside of the restricted group. Not having any swap doesn't help here either. It just moves all the reclaim pressure to the file pages and slabs which struggle already. > but I think we need to be prepared for cases where sending SIGKILL to > all threads sharing the same memory does not help. Sure, unkillable tasks are a problem which we have to handle. Having GFP_KERNEL allocations looping without way out contributes to this which is sad but your current data just show that sometimes it might take ages to finish even without that going on. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>