This ext4 livelock case shows a race window which commit 36324a990cf5 ("oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space") did not care about. ---------- [ 186.620979] Out of memory: Kill process 4458 (file_io.24) score 997 or sacrifice child [ 186.627897] Killed process 4458 (file_io.24) total-vm:4336kB, anon-rss:116kB, file-rss:1024kB, shmem-rss:0kB [ 186.688345] oom_reaper: reaped process 4458 (file_io.24), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 245.572082] MemAlloc: file_io.24(4715) flags=0x400040 switches=8650 uninterruptible dying victim [ 245.578876] file_io.24 D 0000000000000000 0 4715 1 0x00100084 [ 245.584122] ffff88002fd9c000 ffff88002fda4000 ffff880036221870 00000000000035a2 [ 245.589618] 0000000000000000 ffff880036221870 0000000000000000 ffffffff81587dec [ 245.595428] ffff880036221800 ffffffff8123b821 0000000000000000 ffff88002fd9c000 [ 245.601370] Call Trace: [ 245.603428] [<ffffffff81587dec>] ? schedule+0x2c/0x80 [ 245.607680] [<ffffffff8123b821>] ? wait_transaction_locked+0x81/0xc0 /* linux-4.6-rc2/fs/jbd2/transaction.c:163 */ [ 245.613586] [<ffffffff810a1ee0>] ? wait_woken+0x80/0x80 /* linux-4.6-rc2/kernel/sched/wait.c:292 */ [ 245.618074] [<ffffffff8123ba9a>] ? add_transaction_credits+0x21a/0x2a0 /* linux-4.6-rc2/fs/jbd2/transaction.c:191 */ [ 245.623497] [<ffffffff81178abc>] ? mem_cgroup_commit_charge+0x7c/0xf0 [ 245.628352] [<ffffffff8123bceb>] ? start_this_handle+0x18b/0x400 /* linux-4.6-rc2/fs/jbd2/transaction.c:357 */ [ 245.632755] [<ffffffff8110fb6e>] ? add_to_page_cache_lru+0x6e/0xd0 [ 245.637274] [<ffffffff8123c294>] ? jbd2__journal_start+0xf4/0x190 /* linux-4.6-rc2/fs/jbd2/transaction.c:459 */ [ 245.642298] [<ffffffff81205ca4>] ? ext4_da_write_begin+0x114/0x360 /* linux-4.6-rc2/fs/ext4/inode.c:2883 */ [ 245.647035] [<ffffffff8111116e>] ? generic_perform_write+0xce/0x1d0 /* linux-4.6-rc2/mm/filemap.c:2639 */ [ 245.651651] [<ffffffff8119c440>] ? file_update_time+0xc0/0x110 [ 245.656166] [<ffffffff81111f2d>] ? __generic_file_write_iter+0x16d/0x1c0 /* linux-4.6-rc2/mm/filemap.c:2765 */ [ 245.660835] [<ffffffff811fbafa>] ? ext4_file_write_iter+0x12a/0x340 /* linux-4.6-rc2/fs/ext4/file.c:170 */ [ 245.665292] [<ffffffff810226ad>] ? __switch_to+0x20d/0x3f0 [ 245.669604] [<ffffffff81182ddb>] ? __vfs_write+0xcb/0x100 [ 245.673904] [<ffffffff81183968>] ? vfs_write+0x98/0x190 [ 245.678174] [<ffffffff81184d2d>] ? SyS_write+0x4d/0xc0 [ 245.682376] [<ffffffff810034a7>] ? do_syscall_64+0x57/0xf0 [ 245.686845] [<ffffffff8158b1e1>] ? entry_SYSCALL64_slow_path+0x25/0x25 ---------- ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) { ret = __generic_file_write_iter(iocb, from) { written = generic_perform_write(file, from, iocb->ki_pos) { if (fatal_signal_pending(current)) { status = -EINTR; break; } status = a_ops->write_begin(file, mapping, pos, bytes, flags, &page, &fsdata) /* ext4_da_write_begin */ { /***** Event1 *****/ handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, ext4_da_write_credits(inode, pos, len)) /* __ext4_journal_start */ { __ext4_journal_start_sb(inode->i_sb, line, type, blocks, rsv_blocks) { jbd2__journal_start(journal, blocks, rsv_blocks, GFP_NOFS, type, line) { err = start_this_handle(journal, handle, gfp_mask) { if (!journal->j_running_transaction) { /* * If __GFP_FS is not present, then we may be being called from * inside the fs writeback layer, so we MUST NOT fail. */ if ((gfp_mask & __GFP_FS) == 0) gfp_mask |= __GFP_NOFAIL; new_transaction = kmem_cache_zalloc(transaction_cache, gfp_mask); /***** Event2 *****/ if (!new_transaction) return -ENOMEM; } /* We may have dropped j_state_lock - restart in that case */ add_transaction_credits(journal, blocks, rsv_blocks) { /* * If the current transaction is locked down for commit, wait * for the lock to be released. */ if (t->t_state == T_LOCKED) { /***** Event3 *****/ wait_transaction_locked(journal); /***** Event4 *****/ return 1; } } } } } } } } } } Event1 ... The OOM killer sent SIGKILL to file_io.24(4715) because file_io.24(4715) was sharing memory with file_io.24(4458). Event2 ... file_io.24(4715) silently got TIF_MEMDIE using a shortcut fatal_signal_pending(current) in out_of_memory() because kmem_cache_zalloc() is allowed to call out_of_memory() due to __GFP_NOFAIL. Event3 ... The OOM reaper completed reaping memory used by file_io.24(4458) and marked file_io.24(4458) as no longer OOM-killable by now. But since the OOM reaper cleared TIF_MEMDIE from only file_io.24(4458), TIF_MEMDIE in file_io.24(4715) still remains. Event4 ... file_io.24(4715) (which used GFP_NOFS | __GFP_NOFAIL) is waiting for kworker/u128:1(51) (which used GFP_NOFS) to complete wb_workfn. But both kworker/u128:1(51) (which used GFP_NOFS) and kworker/0:2(285) (which used GFP_NOIO) cannot make forward progress because the OOM reaper does not clear TIF_MEMDIE from file_io.24(4715), and the OOM killer does not select next OOM victim due to TIF_MEMDIE in file_io.24(4715). If we remove these shortcuts and set TIF_MEMDIE to all OOM-killed threads sharing the victim's memory at oom_kill_process() and clear TIF_MEMDIE from all threads sharing the victim's memory at __oom_reap_task() (or do equivalent thing using per a signal_struct flag or per a mm_struct flag or a timer), we wouldn't have hit this race window. Thus, I say again, "I think that removing these shortcuts is better." unless we add a guaranteed unlocking mechanism like a timer. Also, I again want to say that, making current thread's current allocation request completed by giving TIF_MEMDIE does not guarantee that the current thread will be able to arrive at do_exit() shortly. It is possible that the current thread is blocked at unkillable wait if current allocation succeeded. Also, is it acceptable to make allocation requests by kworker/u128:1(51) and kworker/0:2(285) fail because they are !__GFP_FS && !__GFP_NOFAIL when file_io.24(4715) has managed to allocate memory for journal's transaction using GFP_NOFS | __GFP_NOFAIL? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>