On Sat 13-06-15 00:23:00, Tetsuo Handa wrote: [...] > >From e59b64683827151a35257384352c70bce61babdd Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> > Date: Fri, 12 Jun 2015 23:56:18 +0900 > Subject: [RFC] oom: implement memdie_task_panic_secs > > OOM killer is a desperate last resort reclaim attempt to free some > memory. It is based on heuristics which will never be 100% and may > result in an unusable or a locked up system. > > panic_on_oom sysctl knob allows to set the OOM policy to panic the > system instead of trying to resolve the OOM condition. This might be > useful for several reasons - e.g. reduce the downtime to a predictable > amount of time, allow to get a crash dump of the system and debug the > issue post-mortem. > > panic_on_oom is, however, a big hammer in many situations when the > OOM condition could be resolved in a reasonable time. So it would be > good to have some middle ground and allow the OOM killer to do its job > but have a failover when things go wrong and it is not able to make any > further progress for a considerable amount of time. > > This patch implements system_memdie_panic_secs sysctl which configures > a maximum timeout for the OOM killer to resolve the OOM situation. > If the system is still under OOM (i.e. the OOM victim cannot release > memory) after the timeout expires, it will panic the system. A > reasonably chosen timeout can protect from both temporal OOM conditions > and allows to have a predictable time frame for the OOM condition. > > Since there are memcg OOM, cpuset OOM, mempolicy OOM as with system OOM, > this patch also implements {memcg,cpuset,mempolicy}_memdie_panic_secs . I really hate having so many knobs. What would they be good for? Why cannot you simply use a single timeout and decide whether to panic or not based on panic_on_oom value? Or do you have any strong reason to put this aside from panic_on_oom? > These will allow administrator to use different timeout settings for > each type of OOM, for administrator still has chance to perform steps > to resolve the potential lockup or trashing from the global context > (e.g. by relaxing restrictions or even rebooting cleanly). > > Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> > --- > include/linux/oom.h | 8 +++++ > include/linux/sched.h | 1 + > kernel/sysctl.c | 39 ++++++++++++++++++++++++ > mm/oom_kill.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++ > 4 files changed, 131 insertions(+) > [...] > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index dff991e..40d7b6d0 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c [...] > +static void check_memdie_task(struct task_struct *p, struct mem_cgroup *memcg, > + const nodemask_t *nodemask) > +{ > + unsigned long start = p->memdie_start; > + unsigned long spent; > + unsigned long timeout = 0; > + > + /* If task does not have TIF_MEMDIE flag, there is nothing to do. */ > + if (!start) > + return; > + spent = jiffies - start; > +#ifdef CONFIG_MEMCG > + /* task_in_mem_cgroup(p, memcg) is true. */ > + if (memcg) > + timeout = sysctl_cgroup_memdie_panic_secs; > +#endif > +#ifdef CONFIG_NUMA > + /* has_intersects_mems_allowed(p, nodemask) is true. */ > + else if (nodemask) > + timeout = sysctl_mempolicy_memdie_panic_secs; > + else > + timeout = sysctl_cpuset_memdie_panic_secs; > +#endif > + if (!timeout) > + timeout = sysctl_system_memdie_panic_secs; > + /* If timeout is disabled, there is nothing to do. */ > + if (!timeout) > + return; > +#ifdef CONFIG_NUMA > + { > + struct task_struct *t; > + > + rcu_read_lock(); > + for_each_thread(p, t) { > + start = t->memdie_start; > + if (start && time_after(spent, timeout * HZ)) > + break; > + } > + rcu_read_unlock(); This doesn't make any sense to me. What are you trying to achieve here? Why would you want to check all threads and do that only for CONFIG_NUMA and even then do a noop if the timeout expired? > + } > +#endif > + if (time_before(spent, timeout * HZ)) > + return; I think the whole function is way too much complicated without a good reason. Why don't you simply store the expire time when marking the task oom victim and compare it with the current jiffies with time_after and be done with it. This is few lines of code... > + panic("Out of memory: %s (%u) did not die within %lu seconds.\n", > + p->comm, p->pid, timeout); It would be better to sync this message with what check_panic_on_oom does. > +} > + > /* return true if the task is not adequate as candidate victim task. */ > static bool oom_unkillable_task(struct task_struct *p, > struct mem_cgroup *memcg, const nodemask_t *nodemask) > @@ -135,6 +209,7 @@ static bool oom_unkillable_task(struct task_struct *p, > if (!has_intersects_mems_allowed(p, nodemask)) > return true; > > + check_memdie_task(p, memcg, nodemask); This is not sufficient. oom_scan_process_thread would break out from the loop when encountering the first TIF_MEMDIE task and could have missed an older one later in the task_list. Besides that oom_unkillable_task doesn't sound like a good match to evaluate this logic. I would expect it to be in oom_scan_process_thread. > return false; > } > > @@ -416,10 +491,17 @@ bool oom_killer_disabled __read_mostly; > */ > void mark_oom_victim(struct task_struct *tsk) > { > + unsigned long start; > + > WARN_ON(oom_killer_disabled); > /* OOM killer might race with memcg OOM */ > if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) > return; > + /* Set current time for is_killable_memdie_task() check. */ > + start = jiffies; > + if (!start) > + start = 1; > + tsk->memdie_start = start; I would rather go with tsk->oom_expire = jiffies + timeout and set the timeout depending on panic_on_oom value (which would require nodemask and memcg parameters here). > /* > * Make sure that the task is woken up from uninterruptible sleep > * if it is frozen because OOM killer wouldn't be able to free > @@ -435,6 +517,7 @@ void mark_oom_victim(struct task_struct *tsk) > */ > void exit_oom_victim(void) > { > + current->memdie_start = 0; Is this really needed? OOM killer shouldn't see the task because it has already released its mm. oom_scan_process_thread checks mm after it TIF_MEMDIE so we can race theoretically but this shouldn't matter much. If a task is still visible after the timeout then there obviously was a problem in making progress. > clear_thread_flag(TIF_MEMDIE); > > if (!atomic_dec_return(&oom_victims)) > -- > 1.8.3.1 > ------------------------------------------------------------ > > By the way, with introduction of per "struct task_struct" variable, I think > that we can replace TIF_MEMDIE checks with memdie_start checks via > > test_tsk_thread_flag(p, TIF_MEMDIE) => p->memdie_start > > test_and_clear_thread_flag(TIF_MEMDIE) => xchg(¤t->memdie_start, 0) > > test_and_set_tsk_thread_flag(p, TIF_MEMDIE) > => xchg(&p->memdie_start, jiffies (or 1 if jiffies == 0)) > > though above patch did not replace TIF_MEMDIE in order to focus on one thing. I fail to see a direct advantage other than to safe one bit in flags. Is something asking for it? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>