Currently, the OOM reaper calls exit_oom_victim() on remote TIF_MEMDIE thread after an OOM reap attempt was made. This behavior is intended for allowing oom_scan_process_thread() to select next OOM victim by making atomic_read(&task->signal->oom_victims) == 0. But since threads can be blocked for unbounded period at __mmput() from mmput() from exit_mm() from do_exit(), we can't risk the OOM reaper being blocked for unbounded period waiting for TIF_MEMDIE threads. Therefore, when we hit a situation that a TIF_MEMDIE thread which is the only thread of that thread group reached tsk->mm = NULL line in exit_mm() from do_exit() before __oom_reap_task() finds a mm via find_lock_task_mm(), oom_reap_task() does not wait for the TIF_MEMDIE thread to return from __mmput() and instead calls exit_oom_victim(). Patch "mm, oom: hide mm which is shared with kthread or global init" tried to avoid OOM livelock by setting MMF_OOM_REAPED, but it is racy because setting MMF_OOM_REAPED will not help when find_lock_task_mm() in oom_scan_process_thread() failed. It is possible (though unlikely) that a !can_oom_reap TIF_MEMDIE thread becomes the only user of that mm (i.e. mm->mm_users drops to 1) and is later blocked for unbounded period at __mmput() from mmput() from exit_mm() from do_exit() when we hit e.g. (1) First round of OOM killer invocation starts. (2) select_bad_process() chooses P1 as an OOM victim because oom_scan_process_thread() does not find existing victims. (3) oom_kill_process() sets TIF_MEMDIE on P1, but does not put P1 under the OOM reaper's supervision due to (p->flags & PF_KTHREAD) being true, and instead sets MMF_OOM_REAPED on the P1's mm. (4) First round of OOM killer invocation finishes. (5) P1 is unable to arrive at do_exit() due to being blocked at unkillable event waiting for somebody else's memory allocation. (6) Second round of OOM killer invocation starts. (7) select_bad_process() chooses P2 as an OOM victim because oom_scan_process_thread() finds P1's mm with MMF_OOM_REAPED set. (8) oom_kill_process() sets TIF_MEMDIE on P2 via mark_oom_victim(), and puts P2 under the OOM reaper's supervision due to (p->flags & PF_KTHREAD) being false. (9) Second round of OOM killer invocation finishes. (10) The OOM reaper reaps P2's mm, and sets MMF_OOM_REAPED to P2's mm, and clears TIF_MEMDIE from P2. (11) Regarding P1's mm, (p->flags & PF_KTHREAD) becomes false because somebody else's memory allocation succeeds and unuse_mm(P1->mm) is called. At this point P1 becomes the only user of P1->mm. (12) P1 arrives at do_exit() due to no longer being blocked at unkillable event waiting for somebody else's memory allocation. (13) P1 reaches P1->mm = NULL line in exit_mm() from do_exit(). (14) P1 is blocked at __mmput(). (15) Third round of OOM killer invocation starts. (16) select_bad_process() does not choose new OOM victim because oom_scan_process_thread() fails to find P1's mm while P1->signal->oom_victims > 0. (17) Third round of OOM killer invocation finishes. (18) OOM livelock happens because nobody will clear TIF_MEMDIE from P1 (and decrement P1->signal->oom_victims) while P1 is blocked at __mmput(). sequence. Thus, we must not depend on find_lock_task_mm() not returning NULL. We must use a raceless method which allows oom_scan_process_thread() to select next OOM victim no matter when TIF_MEMDIE thread passed tsk->mm = NULL line and is blocked for unbounded period at __mmput() from mmput() from exit_mm() from do_exit(). In order to make atomic_read(&task->signal->oom_victims) == 0, oom_kill_process() needs to call exit_oom_victim() when we don't put the TIF_MEMDIE thread under the OOM reaper's supervision, as with oom_reap_task() needs to call exit_oom_victim() when the OOM reaper failed to reap memory due to find_lock_task_mm() in __oom_reap_task() returning NULL. On the other hand, calling exit_oom_victim() on remote TIF_MEMDIE thread is racy with oom_killer_disable() synchronization. It would be possible to call try_to_freeze_tasks(true) after returning from oom_killer_disable(), but it is not optimal. Also, like patch "mm, oom: task_will_free_mem should skip oom_reaped tasks" showed, calling exit_oom_victim() on remote TIF_MEMDIE thread tends to increase possibility of asking for task_will_free_mem() shortcut in out_of_memory(). Also, calling exit_oom_victim(current) from oom_kill_process() prevents current from using ALLOC_NO_WATERMARKS via TIF_MEMDIE. Therefore, we should consider using a raceless method which does not depend on atomic_read(&task->signal->oom_victims) == 0 and atomic_read(&oom_victims) == 0 in order to respectively allow oom_scan_process_thread() to select next OOM victim and allow oom_killer_disable() synchronization to act as a full "barrier". This patch introduces "struct signal_struct"->oom_ignore_victims flag and sets that flag instead of calling exit_oom_victim(). oom_killer_disable() can use wait_event_timeout() in case something went terribly wrong. Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> Cc: Michal Hocko <mhocko@xxxxxxxx> Cc: Oleg Nesterov <oleg@xxxxxxxxxx> Cc: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx> Cc: David Rientjes <rientjes@xxxxxxxxxx> --- include/linux/sched.h | 1 + mm/oom_kill.c | 30 +++++++++++------------------- 2 files changed, 12 insertions(+), 19 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index a1b4475..c5281da 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -801,6 +801,7 @@ struct signal_struct { * oom */ bool oom_flag_origin; + bool oom_ignore_victims; /* Ignore oom_victims value */ short oom_score_adj; /* OOM kill score adjustment */ short oom_score_adj_min; /* OOM kill score adjustment min value. * Only settable by CAP_SYS_RESOURCE. */ diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 4c21f74..aa030c8 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -284,21 +284,12 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, /* * This task already has access to memory reserves and is being killed. * Don't allow any other task to have access to the reserves unless - * the task has MMF_OOM_REAPED because chances that it would release - * any memory is quite low. + * the task has oom_ignore_victims which is set when we can't wait for + * exit_oom_victim(). */ - if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) { - struct task_struct *p = find_lock_task_mm(task); - enum oom_scan_t ret = OOM_SCAN_ABORT; - - if (p) { - if (test_bit(MMF_OOM_REAPED, &p->mm->flags)) - ret = OOM_SCAN_CONTINUE; - task_unlock(p); - } - - return ret; - } + if (!is_sysrq_oom(oc) && atomic_read(&task->signal->oom_victims)) + return task->signal->oom_ignore_victims ? + OOM_SCAN_CONTINUE : OOM_SCAN_ABORT; /* * If task is allocating a lot of memory and has been marked to be @@ -593,13 +584,13 @@ static void oom_reap_task(struct task_struct *tsk) } /* - * Clear TIF_MEMDIE because the task shouldn't be sitting on a - * reasonably reclaimable memory anymore or it is not a good candidate - * for the oom victim right now because it cannot release its memory - * itself nor by the oom reaper. + * Let oom_scan_process_thread() ignore this task because the task + * shouldn't be sitting on a reasonably reclaimable memory anymore or + * it is not a good candidate for the oom victim right now because it + * cannot release its memory itself nor by the oom reaper. */ tsk->oom_reaper_list = NULL; - exit_oom_victim(tsk); + tsk->signal->oom_ignore_victims = true; /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); @@ -930,6 +921,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, */ can_oom_reap = false; set_bit(MMF_OOM_REAPED, &mm->flags); + victim->signal->oom_ignore_victims = true; pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n", task_pid_nr(victim), victim->comm, task_pid_nr(p), p->comm); -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>