On Mon 13-06-16 13:19:43, Michal Hocko wrote: [...] > I am trying to remember why we are disabling oom killer before kernel > threads are frozen but not really sure about that right away. OK, I guess I remember now. Say that a task would depend on a freezable kernel thread to get to do_exit (stuck in wait_event etc...). We would simply get stuck in oom_killer_disable for ever. So we need to address it a different way. One way would be what you are proposing but I guess it would be more systematic to never call exit_oom_victim on a remote task. After [1] we have a solid foundation to rely only on MMF_REAPED even when TIF_MEMDIE is set. It is more code than your patch so I can see a reason to go with yours if the following one seems too large or ugly. [1] http://lkml.kernel.org/r/1466426628-15074-1-git-send-email-mhocko@xxxxxxxxxx What do you think about the following? --- >From f7ffecf944648f16d98fc70275041433ed68f7e4 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@xxxxxxxx> Date: Mon, 20 Jun 2016 17:28:18 +0200 Subject: [PATCH] oom, suspend: fix oom_reaper vs. oom_killer_disable race Tetsuo has reported the following potential oom_killer_disable vs. oom_reaper race: (1) freeze_processes() starts freezing user space threads. (2) Somebody (maybe a kenrel thread) calls out_of_memory(). (3) The OOM killer calls mark_oom_victim() on a user space thread P1 which is already in __refrigerator(). (4) oom_killer_disable() sets oom_killer_disabled = true. (5) P1 leaves __refrigerator() and enters do_exit(). (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call exit_oom_victim(P1). (7) oom_killer_disable() returns while P1 not yet finished (8) P1 perform IO/interfere with the freezer. This patch effectivelly reverts 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper"). We have a more efficient way to achieve the same. We can rely on MMF_REAPED because "mm, oom: hide mm which is shared with kthread or global init" made sure that we skip tasks with MMF_REAPED even when they have TIF_MEMDIE. The only missing part is that we have to handle __oom_reap_task failure gracefully. Currently we mark the mm by MMF_OOM_NOT_REAPABLE to later set MMF_REAPED should we fail __oom_reap_task again. This relies on clearing TIF_MEMDIE to select that task again. We can achieve a similar thing by requeuing the task to the oom_reaper_list. Introduce oom_reaper_add_tail to add the task to the tail of the oom_reaper_list. While we are at it reuse it for wake_oom_reaper as well because the queueing should be FIFO. In the result only the oom victim will be responsible to clear its TIF_MEMDIE flag so the above race is not possible anymore. oom_reaper will be allowed to play only with MMF_* flags so the responsibility model will be more clear. Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper") Reported-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> Signed-off-by: Michal Hocko <mhocko@xxxxxxxx> --- mm/oom_kill.c | 86 ++++++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 55 insertions(+), 31 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 4c21f744daa6..dbc62a69fee8 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -558,6 +558,51 @@ static bool __oom_reap_task(struct task_struct *tsk) return ret; } +static void oom_reaper_add_tail(struct task_struct *tsk) +{ + struct task_struct *q; + + spin_lock(&oom_reaper_lock); + if (!oom_reaper_list) { + oom_reaper_list = tsk; + goto unlock; + } + + for (q = oom_reaper_list; q->oom_reaper_list; q = q->oom_reaper_list) + ; + q->oom_reaper_list = tsk; + +unlock: + spin_unlock(&oom_reaper_lock); +} + +static inline bool requeue_oom_victim(struct task_struct *tsk) +{ + struct task_struct *p; + bool ret; + + /* + * If we've already tried to reap this task in the past and + * failed it probably doesn't make much sense to try yet again + * so hide the mm from the oom killer so that it can move on + * to another task with a different mm struct. + */ + p = find_lock_task_mm(tsk); + if (!p) + return false; + + if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) + set_bit(MMF_OOM_REAPED, &p->mm->flags); + else + ret = true; + + task_unlock(p); + + if (ret) + oom_reaper_add_tail(tsk); + return ret; +} + #define MAX_OOM_REAP_RETRIES 10 static void oom_reap_task(struct task_struct *tsk) { @@ -567,40 +612,23 @@ static void oom_reap_task(struct task_struct *tsk) while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk)) schedule_timeout_idle(HZ/10); - if (attempts > MAX_OOM_REAP_RETRIES) { - struct task_struct *p; + tsk->oom_reaper_list = NULL; + if (attempts > MAX_OOM_REAP_RETRIES) { pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); - /* - * If we've already tried to reap this task in the past and - * failed it probably doesn't make much sense to try yet again - * so hide the mm from the oom killer so that it can move on - * to another task with a different mm struct. - */ - p = find_lock_task_mm(tsk); - if (p) { - if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) { - pr_info("oom_reaper: giving up pid:%d (%s)\n", - task_pid_nr(tsk), tsk->comm); - set_bit(MMF_OOM_REAPED, &p->mm->flags); - } - task_unlock(p); + debug_show_all_locks(); + + if (requeue_oom_victim(tsk)) { + cond_resched(); + return; } - debug_show_all_locks(); + pr_info("oom_reaper: giving up pid:%d (%s)\n", + task_pid_nr(tsk), tsk->comm); } - /* - * Clear TIF_MEMDIE because the task shouldn't be sitting on a - * reasonably reclaimable memory anymore or it is not a good candidate - * for the oom victim right now because it cannot release its memory - * itself nor by the oom reaper. - */ - tsk->oom_reaper_list = NULL; - exit_oom_victim(tsk); - /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); } @@ -637,11 +665,7 @@ void wake_oom_reaper(struct task_struct *tsk) return; get_task_struct(tsk); - - spin_lock(&oom_reaper_lock); - tsk->oom_reaper_list = oom_reaper_list; - oom_reaper_list = tsk; - spin_unlock(&oom_reaper_lock); + oom_reaper_add_tail(tsk); wake_up(&oom_reaper_wait); } -- 2.8.1 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>