Michal Hocko wrote: > On Fri 08-01-16 00:38:43, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > @@ -333,6 +333,14 @@ static struct task_struct *select_bad_process(struct oom_control *oc, > > > if (points == chosen_points && thread_group_leader(chosen)) > > > continue; > > > > > > + /* > > > + * If the current major task is already ooom killed and this > > > + * is sysrq+f request then we rather choose somebody else > > > + * because the current oom victim might be stuck. > > > + */ > > > + if (is_sysrq_oom(sc) && test_tsk_thread_flag(p, TIF_MEMDIE)) > > > + continue; > > > + > > > chosen = p; > > > chosen_points = points; > > > } > > > > Do we want to require SysRq-f for each thread in a process? > > If g has 1024 p, dump_tasks() will do > > > > pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu %5hd %s\n", > > > > for 1024 times? I think one SysRq-f per one process is sufficient. > > I am not following you here. If we kill the process the whole process > group (aka all threads) will get killed which ever thread we happen to > send the sigkill to. Please distinguish "sending SIGKILL to a process" and "all threads in that process terminate". do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true) sends SIGKILL to a victim process, but it does not guarantee that all threads in that process terminate even if the OOM reaper reclaimed memory. That's when SysRq-f (and timeout based next victim selection) is needed but currently SysRq-f forever continues selecting incorrect process. I can observe SysRq-f is disabled (Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160112.txt.xz .) ---------- [ 86.767482] a.out invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO) [ 86.769905] a.out cpuset=/ mems_allowed=0 [ 86.771393] CPU: 2 PID: 9573 Comm: a.out Not tainted 4.4.0-next-20160112+ #279 (...snipped...) [ 86.874710] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name (...snipped...) [ 86.945286] [ 9573] 1000 9573 541717 402522 796 6 0 0 a.out [ 86.947457] [ 9574] 1000 9574 1078 21 7 3 0 0 a.out [ 86.949568] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 86.951538] Killed process 9574 (a.out) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB [ 86.955296] systemd-journal invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|GFP_COLD) [ 86.958035] systemd-journal cpuset=/ mems_allowed=0 (...snipped...) [ 87.128808] [ 9573] 1000 9573 541717 402522 796 6 0 0 a.out [ 87.130926] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out [ 87.133055] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 87.134989] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 116.979564] sysrq: SysRq : Manual OOM execution [ 116.984119] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL) [ 116.986367] kworker/0:8 cpuset=/ mems_allowed=0 (...snipped...) [ 117.157045] [ 9573] 1000 9573 541717 402522 797 6 0 0 a.out [ 117.159191] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out [ 117.161302] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 117.163250] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 119.043685] sysrq: SysRq : Manual OOM execution [ 119.046239] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL) [ 119.048453] kworker/0:8 cpuset=/ mems_allowed=0 (...snipped...) [ 119.215982] [ 9573] 1000 9573 541717 402522 797 6 0 0 a.out [ 119.218122] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out [ 119.220237] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 119.222129] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 120.179644] sysrq: SysRq : Manual OOM execution [ 120.206938] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL) [ 120.209152] kworker/0:8 cpuset=/ mems_allowed=0 (...snipped...) [ 120.376821] [ 9573] 1000 9573 541717 402522 797 6 0 0 a.out [ 120.378924] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out [ 120.381065] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 120.382929] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 121.235296] sysrq: SysRq : Manual OOM execution [ 121.252742] kworker/0:8 invoked oom-killer: order=-1, oom_score_adj=0, gfp_mask=0x24000c0(GFP_KERNEL) [ 121.254955] kworker/0:8 cpuset=/ mems_allowed=0 (...snipped...) [ 141.024984] a.out D ffff88007c417948 0 9573 8117 0x00000080 [ 141.026830] ffff88007c417948 ffff880076cac2c0 ffff880076c442c0 ffff88007c418000 [ 141.028789] ffff88007c417980 ffff88007fc90240 00000000fffd7aa1 00000000000006bc [ 141.030746] ffff88007c417960 ffffffff816fc1a7 ffff88007fc90240 ffff88007c417a08 [ 141.032703] Call Trace: [ 141.033653] [<ffffffff816fc1a7>] schedule+0x37/0x90 [ 141.035056] [<ffffffff81700567>] schedule_timeout+0x117/0x1c0 [ 141.036629] [<ffffffff810e1310>] ? init_timer_key+0x40/0x40 [ 141.038182] [<ffffffff81700694>] schedule_timeout_uninterruptible+0x24/0x30 [ 141.039963] [<ffffffff8114944b>] __alloc_pages_nodemask+0x91b/0xd90 [ 141.041631] [<ffffffff811925e6>] alloc_pages_vma+0xb6/0x290 [ 141.043173] [<ffffffff811711d0>] handle_mm_fault+0x1180/0x1630 [ 141.044770] [<ffffffff811700a4>] ? handle_mm_fault+0x54/0x1630 [ 141.046355] [<ffffffff8105a651>] __do_page_fault+0x1a1/0x440 [ 141.047915] [<ffffffff8105a920>] do_page_fault+0x30/0x80 [ 141.049408] [<ffffffff81702307>] ? native_iret+0x7/0x7 [ 141.050876] [<ffffffff817033e8>] page_fault+0x28/0x30 [ 141.052327] [<ffffffff813a6f3d>] ? __clear_user+0x3d/0x70 [ 141.053831] [<ffffffff813ab9e8>] iov_iter_zero+0x68/0x250 [ 141.055346] [<ffffffff814866a8>] read_iter_zero+0x38/0xb0 [ 141.056854] [<ffffffff811c0994>] __vfs_read+0xc4/0xf0 [ 141.058295] [<ffffffff811c154a>] vfs_read+0x7a/0x120 [ 141.059711] [<ffffffff811c1df3>] SyS_read+0x53/0xd0 [ 141.061104] [<ffffffff81701772>] entry_SYSCALL_64_fastpath+0x12/0x76 [ 141.062768] a.out x ffff88007b92fca0 0 9574 9573 0x00000084 [ 141.064604] ffff88007b92fca0 ffff880076cac2c0 ffff88007a862c80 ffff88007b930000 [ 141.066555] ffff88007a863040 ffff88007a863308 ffff88007a862c80 ffff88007cc10000 [ 141.068492] ffff88007b92fcb8 ffffffff816fc1a7 ffff88007a863308 ffff88007b92fd28 [ 141.070437] Call Trace: [ 141.071389] [<ffffffff816fc1a7>] schedule+0x37/0x90 [ 141.072788] [<ffffffff810733fe>] do_exit+0x6be/0xb50 [ 141.074198] [<ffffffff81073917>] do_group_exit+0x47/0xc0 [ 141.075676] [<ffffffff8107f122>] get_signal+0x222/0x7e0 [ 141.077135] [<ffffffff8100f232>] do_signal+0x32/0x6d0 [ 141.078570] [<ffffffff81095cc8>] ? finish_task_switch+0xa8/0x2b0 [ 141.080176] [<ffffffff8106b967>] ? syscall_slow_exit_work+0x4b/0x10d [ 141.081837] [<ffffffff81095cc8>] ? finish_task_switch+0xa8/0x2b0 [ 141.083441] [<ffffffff8106b8ba>] ? exit_to_usermode_loop+0x2e/0x90 [ 141.085063] [<ffffffff8106b8d8>] exit_to_usermode_loop+0x4c/0x90 [ 141.086667] [<ffffffff8100355b>] syscall_return_slowpath+0xbb/0x130 [ 141.088305] [<ffffffff817018da>] int_ret_from_sys_call+0x25/0x9f [ 141.089896] a.out D ffff88007be2fab8 0 9575 9573 0x00100084 [ 141.091734] ffff88007be2fab8 ffff880036509640 ffff8800366742c0 ffff88007be30000 [ 141.093688] 0000000000000000 7fffffffffffffff ffff88007ff72cb8 ffffffff816fca00 [ 141.095743] ffff88007be2fad0 ffffffff816fc1a7 ffff88007fc17280 ffff88007be2fb70 [ 141.097699] Call Trace: [ 141.098649] [<ffffffff816fca00>] ? bit_wait+0x60/0x60 [ 141.100071] [<ffffffff816fc1a7>] schedule+0x37/0x90 [ 141.101453] [<ffffffff817005c8>] schedule_timeout+0x178/0x1c0 [ 141.103001] [<ffffffff810e81e2>] ? ktime_get+0x102/0x130 [ 141.104468] [<ffffffff810bdfd9>] ? trace_hardirqs_on_caller+0xf9/0x1c0 [ 141.106158] [<ffffffff810be0ad>] ? trace_hardirqs_on+0xd/0x10 [ 141.107698] [<ffffffff810e8187>] ? ktime_get+0xa7/0x130 [ 141.109138] [<ffffffff811276ea>] ? __delayacct_blkio_start+0x1a/0x30 [ 141.110782] [<ffffffff816fb641>] io_schedule_timeout+0xa1/0x110 [ 141.112350] [<ffffffff816fca16>] bit_wait_io+0x16/0x70 [ 141.113774] [<ffffffff816fc62b>] __wait_on_bit+0x5b/0x90 [ 141.115234] [<ffffffff8113f83a>] ? find_get_pages_tag+0x19a/0x2c0 [ 141.116824] [<ffffffff8113e5c6>] wait_on_page_bit+0xc6/0xf0 [ 141.118319] [<ffffffff810b5830>] ? autoremove_wake_function+0x30/0x30 [ 141.119983] [<ffffffff8113e797>] __filemap_fdatawait_range+0x107/0x190 [ 141.121643] [<ffffffff81140a8c>] ? __filemap_fdatawrite_range+0xcc/0x100 [ 141.123352] [<ffffffff8113e82f>] filemap_fdatawait_range+0xf/0x30 [ 141.124955] [<ffffffff81140bad>] filemap_write_and_wait_range+0x3d/0x60 [ 141.126655] [<ffffffff812b2614>] xfs_file_fsync+0x44/0x180 [ 141.128149] [<ffffffff811f482b>] vfs_fsync_range+0x3b/0xb0 [ 141.129646] [<ffffffff812b4242>] xfs_file_write_iter+0x102/0x140 [ 141.131260] [<ffffffff811c0a87>] __vfs_write+0xc7/0x100 [ 141.132702] [<ffffffff811c168d>] vfs_write+0x9d/0x190 [ 141.134108] [<ffffffff811e104a>] ? __fget_light+0x6a/0x90 [ 141.135593] [<ffffffff811c1ec3>] SyS_write+0x53/0xd0 [ 141.136998] [<ffffffff81701772>] entry_SYSCALL_64_fastpath+0x12/0x76 [ 141.138646] a.out D ffff88007af4fce8 0 9576 9573 0x00000084 [ 141.140490] ffff88007af4fce8 ffff8800366742c0 ffff880036672c80 ffff88007af50000 [ 141.142415] ffff88007d14a5b0 ffff880036672c80 0000000000000246 00000000ffffffff [ 141.144331] ffff88007af4fd00 ffffffff816fc1a7 ffff88007d14a5a8 ffff88007af4fd10 [ 141.146308] Call Trace: [ 141.147261] [<ffffffff816fc1a7>] schedule+0x37/0x90 [ 141.148651] [<ffffffff816fc4d0>] schedule_preempt_disabled+0x10/0x20 [ 141.150326] [<ffffffff816fd31b>] mutex_lock_nested+0x17b/0x3e0 [ 141.151902] [<ffffffff812b3faf>] ? xfs_file_buffered_aio_write+0x5f/0x1f0 [ 141.153647] [<ffffffff812b3faf>] xfs_file_buffered_aio_write+0x5f/0x1f0 [ 141.155397] [<ffffffff812b41c4>] xfs_file_write_iter+0x84/0x140 [ 141.156989] [<ffffffff811c0a87>] __vfs_write+0xc7/0x100 [ 141.158460] [<ffffffff811c168d>] vfs_write+0x9d/0x190 [ 141.159933] [<ffffffff811e104a>] ? __fget_light+0x6a/0x90 [ 141.161417] [<ffffffff811c1ec3>] SyS_write+0x53/0xd0 [ 141.162853] [<ffffffff81701772>] entry_SYSCALL_64_fastpath+0x12/0x76 (...snipped...) [ 181.154922] [ 9573] 1000 9573 541717 402522 797 6 0 0 a.out [ 181.157145] [ 9575] 1000 9574 1078 0 7 3 0 0 a.out [ 181.159265] Out of memory: Kill process 9573 (a.out) score 908 or sacrifice child [ 181.161160] Killed process 9575 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 184.227075] sysrq: SysRq : Kill All Tasks ---------- using linux-next-20160112 without "mm,oom: exclude TIF_MEMDIE processes from candidates." patch, and reproducer shown below. ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> static int file_writer(void *unused) { static char buffer[4096] = { }; const int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND | O_SYNC, 0600); while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)); return 0; } static int memory_consumer(void *unused) { const int fd = open("/dev/zero", O_RDONLY); unsigned long size; char *buf = NULL; sleep(1); unlink("/tmp/file"); for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } read(fd, buf, size); /* Will cause OOM due to overcommit */ return 0; } int main(int argc, char *argv[]) { if (fork() == 0) { int i; for (i = 0; i < 10; i++) { char *cp = malloc(4096); if (!cp || clone(file_writer, cp + 4096, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL) == -1) break; } } else { memory_consumer(NULL); } while (1) pause(); } ---------- > > > How can we guarantee that find_lock_task_mm() from oom_kill_process() > > chooses !TIF_MEMDIE thread when try_to_sacrifice_child() somehow chose > > !TIF_MEMDIE thread? I think choosing !TIF_MEMDIE thread at > > find_lock_task_mm() is the simplest way. > > find_lock_task_mm chosing TIF_MEMDIE thread shouldn't change anything > because the whole thread group will go down anyway. If you want to > guarantee that the sysrq+f never choses a task which has a TIF_MEMDIE > thread then we would have to check for fatal_signal_pending as well > AFAIU. Fiddling with find find_lock_task_mm will not help you though > unless I am missing something. I do want to guarantee that the SysRq-f (and timeout based next victim selection) never chooses a process which has a TIF_MEMDIE thread. I don't like current "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space" patch unless both "mm,oom: exclude TIF_MEMDIE processes from candidates." patch and "mm,oom: Re-enable OOM killer using timers." patch are used together. Since your patch covers only likely case, your patch cannot become alternative to my patches which cover unlikely cases. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>