On Sat 20-12-14 20:42:08, Tetsuo Handa wrote: [...] > >From a2ebb5b873ec5af45e0bea9ea6da2a93c0f06c35 Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> > Date: Sat, 20 Dec 2014 20:05:14 +0900 > Subject: [PATCH] oom: Close race of setting TIF_MEMDIE to mm-less process. > > exit_mm() and oom_kill_process() could race with regard to handling of > TIF_MEMDIE flag if sequence described below occurred. > > P1 calls out_of_memory(). out_of_memory() calls select_bad_process(). > select_bad_process() calls oom_scan_process_thread(P2). If P2->mm != NULL > and task_will_free_mem(P2) == false, oom_scan_process_thread(P2) returns > OOM_SCAN_OK. And if P2 is chosen as a victim task, select_bad_process() > returns P2 after calling get_task_struct(P2). Then, P1 goes to sleep and > P2 is woken up. P2 enters into do_exit() and gets PF_EXITING at exit_signals() > and releases mm at exit_mm(). Then, P2 goes to sleep and P1 is woken up. > P1 calls oom_kill_process(P2). oom_kill_process() sets TIF_MEMDIE on P2 > because task_will_free_mem(P2) == true due to PF_EXITING already set. > Afterward, oom_scan_process_thread(P2) will return OOM_SCAN_ABORT because > test_tsk_thread_flag(P2, TIF_MEMDIE) is checked before P2->mm is checked. > > If TIF_MEMDIE was again set to P2, the OOM killer will be blocked by P2 > sitting in the final schedule() waiting for P2's parent to reap P2. > It will trigger an OOM livelock if P2's parent is unable to reap P2 due to > doing an allocation and waiting for the OOM killer to kill P2. > > To close this race window, clear TIF_MEMDIE if P2->mm == NULL after > set_tsk_thread_flag(P2, TIF_MEMDIE) is done. I do not think this patch is sufficient. P2 could pass exit_mm() right after task_unlock in oom_kill_process and we would set TIF_MEMDIE to this task as well. Something like the following should work and it doesn't add memory barriers trickery. --- >From f3f6d498259d08bf5c076f6321580cb9679b2d65 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@xxxxxxx> Date: Mon, 22 Dec 2014 20:55:29 +0100 Subject: [PATCH] oom: Make sure that TIF_MEMDIE is set under task_lock OOM killer tries to exlude tasks which do not have mm_struct associated because killing such a task wouldn't help much. The OOM victim gets TIF_MEMDIE set to disable OOM killer while the current victim releases the memory and then enables the OOM killer again by dropping the flag. oom_kill_process is currently prone to a race condition when the OOM victim is already exiting and TIF_MEMDIE is set after it the task releases its address space. This might theoretically lead to OOM livelock if the OOM victim blocks on an allocation later during exiting because it wouldn't kill any other process and the exiting one won't be able to exit. The situation is highly unlikely because the OOM victim is expected to release some memory which should help to sort out OOM situation. Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock which will serialize the OOM killer with exit_mm which sets task->mm to NULL. Setting the flag for current is not necessary because check and set is not racy. Reported-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> Signed-off-by: Michal Hocko <mhocko@xxxxxxx> --- mm/oom_kill.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d503e9ce1c7b..ef8a4f9c5671 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -438,11 +438,14 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (task_will_free_mem(p)) { + task_lock(p); + if (p->mm && task_will_free_mem(p)) { set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); + task_unlock(p); return; } + task_unlock(p); if (__ratelimit(&oom_rs)) dump_header(p, gfp_mask, order, memcg, nodemask); @@ -492,6 +495,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, /* mm cannot safely be dereferenced after task_unlock(victim) */ mm = victim->mm; + set_tsk_thread_flag(victim, TIF_MEMDIE); pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), K(get_mm_counter(victim->mm, MM_ANONPAGES)), @@ -522,7 +526,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } rcu_read_unlock(); - set_tsk_thread_flag(victim, TIF_MEMDIE); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); put_task_struct(victim); } -- 2.1.4 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>