Hi, I was thinking about this and I am more and more convinced that we shouldn't care about panic_on_oom=2 configuration for now and go with the simplest solution first. I have revisited my original patch and replaced delayed work by a timer based on the feedback from Tetsuo. I think we can rely on timers. A downside would be that we cannot dump the full OOM report from the IRQ context because we rely on task_lock which is not IRQ safe. But I do not think we really need it. An OOM report will be in the log already most of the time and show_mem will tell us the current memory situation. What do you think? --- >From cec53a806cc19b691d6a63544a1283aa90cf92f1 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@xxxxxxx> Date: Tue, 9 Jun 2015 16:15:42 +0200 Subject: [PATCH] oom: implement panic_on_oom_timeout for panic_on_oom=1 OOM killer is a desparate last resort reclaim attempt to free some memory. It is based on heuristics which will never be 100% and may result in an unusable or a locked up system. panic_on_oom sysctl knob allows to set the OOM policy to panic the system instead of trying to resolve the OOM condition. This might be useful for several reasons - e.g. reduce the downtime to a predictable amount of time, allow to get a crash dump of the system and debug the issue post-mortem. panic_on_oom is, however, a big hammer in many situations when the OOM condition could be resolved in a reasonable time. So it would be good to have some middle ground and allow the OOM killer to do its job but have a failover when things go wrong and it is not able to make any further progress for a considerable amount of time. This patch implements panic_on_oom_timeout sysctl which is active only when panic_on_oom=1 and it configures a maximum timeout for the OOM killer to resolve the OOM situation. If the system is still under OOM after the timeout expires it will panic the system. A reasonably chosen timeout can protect from both temporal OOM conditions and allows to have a predictable time frame for the OOM condition. The feature is implemented by a timer which is scheduled when the OOM condition is declared for the first time (oom_victims is still zero) in out_of_memory and it is canceled in exit_oom_victim after the oom_victims count drops down to zero. For this time period OOM killer cannot kill new tasks and it only allows exiting or killed tasks to access memory reserves (and increase oom_victims counter via mark_oom_victim) in order to make a progress so it is reasonable to consider the elevated oom_victims count as an ongoing OOM condition. Timeout for panic_on_oom=2 is not currently supported because it would make the code more complicated and it is even not clear whether this combination is useful. panic_on_oom=2 is disputable on its own because the system is still usable at the time so the administrator can intervene to resolve the OOM situation (e.g. relax NUMA restrictions, increase memory limit for the oom memcg, reboot/panic the system or simply selectively kill a task which might be blocking oom killer from making progress). The log will then contain something like: [ 32.071128] Out of memory: Kill process 2998 (mem_eater) score 42 or sacrifice child [ 32.074022] Killed process 2998 (mem_eater) total-vm:14172kB, anon-rss:628kB, file-rss:4kB [ 32.091849] mem_eater invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 [ 32.103139] mem_eater cpuset=/ mems_allowed=0 [...] [ 32.354388] Killed process 3001 (mem_eater) total-vm:14172kB, anon-rss:580kB, file-rss:4kB [ 33.088078] panic_on_oom timeout 1s has expired [ 33.089062] Mem-Info: [ 33.089557] active_anon:7991 inactive_anon:8106 isolated_anon:665 [ 33.089557] active_file:183 inactive_file:125 isolated_file:0 [ 33.089557] unevictable:0 dirty:0 writeback:1037 unstable:0 [ 33.089557] slab_reclaimable:2492 slab_unreclaimable:3253 [ 33.089557] mapped:275 shmem:0 pagetables:631 bounce:0 [ 33.089557] free:400 free_pcp:20 free_cma:0 [...] [ 33.092023] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled [ 33.092023] [ 33.092023] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.0.0-oomtimeout5-00001-g667aff689092 #582 [ 33.092023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150428_134905-gandalf 04/01/2014 [ 33.092023] 0000000000000101 ffff880007803da8 ffffffff81538b24 ffffffff8108a645 [ 33.092023] ffffffff817ee27d ffff880007803e28 ffffffff81537ac6 ffff880007803e08 [ 33.092023] ffff880000000008 ffff880007803e38 ffff880007803dd8 ffffffff8108a645 [ 33.092023] Call Trace: [ 33.092023] <IRQ> [<ffffffff81538b24>] dump_stack+0x4f/0x7b [ 33.092023] [<ffffffff8108a645>] ? console_unlock+0x337/0x366 [ 33.092023] [<ffffffff81537ac6>] panic+0xbe/0x1eb [ 33.092023] [<ffffffff8108a645>] ? console_unlock+0x337/0x366 [ 33.092023] [<ffffffff8108a645>] ? console_unlock+0x337/0x366 [ 33.092023] [<ffffffff810fc561>] ? ftrace_raw_output_oom_score_adj_update+0x6d/0x6d [ 33.092023] [<ffffffff810fc596>] delayed_panic_on_oom+0x35/0x35 [ 33.092023] [<ffffffff8109866f>] call_timer_fn+0xa7/0x1d0 [ 33.092023] [<ffffffff810fc561>] ? ftrace_raw_output_oom_score_adj_update+0x6d/0x6d [ 33.092023] [<ffffffff81098eb6>] run_timer_softirq+0x22d/0x2a1 [ 33.092023] [<ffffffff810463f3>] __do_softirq+0x141/0x322 [ 33.092023] [<ffffffff810467af>] irq_exit+0x6f/0xb6 TODO: Documentation update Signed-off-by: Michal Hocko <mhocko@xxxxxxx> --- include/linux/oom.h | 1 + kernel/sysctl.c | 8 ++++++++ mm/oom_kill.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 51 insertions(+), 2 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 061e0ffd3493..6884c8dc37a0 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -100,4 +100,5 @@ static inline bool task_will_free_mem(struct task_struct *task) extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; extern int sysctl_panic_on_oom; +extern int sysctl_panic_on_oom_timeout; #endif /* _INCLUDE_LINUX_OOM_H */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index d6fff89b78db..3ac2e5d0b1e2 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1141,6 +1141,14 @@ static struct ctl_table vm_table[] = { .extra2 = &two, }, { + .procname = "panic_on_oom_timeout", + .data = &sysctl_panic_on_oom_timeout, + .maxlen = sizeof(sysctl_panic_on_oom_timeout), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + }, + { .procname = "oom_kill_allocating_task", .data = &sysctl_oom_kill_allocating_task, .maxlen = sizeof(sysctl_oom_kill_allocating_task), diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d7fb1275e200..d48e02627ad0 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -40,6 +40,7 @@ #include <trace/events/oom.h> int sysctl_panic_on_oom; +int sysctl_panic_on_oom_timeout; int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks = 1; @@ -430,6 +431,16 @@ void mark_oom_victim(struct task_struct *tsk) atomic_inc(&oom_victims); } +static void delayed_panic_on_oom(unsigned long unused) +{ + /* We are in the irq context so we cannot call dump_header */ + pr_info("panic_on_oom timeout %ds has expired\n", sysctl_panic_on_oom_timeout); + show_mem(SHOW_MEM_FILTER_NODES); + panic("Out of memory: system-wide panic_on_oom is enabled\n"); +} + +static DEFINE_TIMER(panic_on_oom_timer, delayed_panic_on_oom, 0, 0); + /** * exit_oom_victim - note the exit of an OOM victim */ @@ -437,8 +448,10 @@ void exit_oom_victim(void) { clear_thread_flag(TIF_MEMDIE); - if (!atomic_dec_return(&oom_victims)) + if (!atomic_dec_return(&oom_victims)) { + del_timer(&panic_on_oom_timer); wake_up_all(&oom_victims_wait); + } } /** @@ -539,7 +552,7 @@ static void __oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, p = find_lock_task_mm(victim); if (!p) { put_task_struct(victim); - return; + goto no_oom_victim; } else if (victim != p) { get_task_struct(p); put_task_struct(victim); @@ -581,6 +594,15 @@ static void __oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); put_task_struct(victim); + return; +no_oom_victim: + /* + * If there is no oom victim selected (e.g. we might have raced with + * an exiting task) then cancel delayed oom panic handler. + */ + if (!atomic_read(&oom_victims)) + del_timer(&panic_on_oom_timer); + } #undef K @@ -624,6 +646,24 @@ void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, if (constraint != CONSTRAINT_NONE) return; } + + if (sysctl_panic_on_oom_timeout) { + if (sysctl_panic_on_oom > 1) { + pr_warn("panic_on_oom_timeout is ignored for panic_on_oom=2\n"); + } else { + /* + * Only schedule the delayed panic_on_oom when this is + * the first OOM triggered. oom_lock will protect us + * from races + */ + if (atomic_read(&oom_victims)) + return; + + mod_timer(&panic_on_oom_timer, + jiffies + (sysctl_panic_on_oom_timeout * HZ)); + return; + } + } dump_header(NULL, gfp_mask, order, memcg, nodemask); panic("Out of memory: %s panic_on_oom is enabled\n", sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide"); -- 2.1.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>