Michal Hocko wrote: > On Fri 16-06-17 21:22:20, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > OK, could you play with the patch/idea suggested in > > > http://lkml.kernel.org/r/20170615122031.GL1486@xxxxxxxxxxxxxx? > > > > I think we don't need to worry about mmap_sem dependency inside __mmput(). > > Since the OOM killer checks for !MMF_OOM_SKIP mm rather than TIF_MEMDIE thread, > > we can keep the OOM killer disabled until we set MMF_OOM_SKIP to the victim's mm. > > That is, elevating mm_users throughout the reaping procedure does not cause > > premature victim selection, even after TIF_MEMDIE is cleared from the victim's > > thread. Then, we don't need to use down_write()/up_write() for non OOM victim's mm > > (nearly 100% of exit_mmap() calls), and can force partial reaping of OOM victim's mm > > (nearly 0% of exit_mmap() calls) before __mmput() starts doing exit_aio() etc. > > Patch is shown below. Only compile tested. > > Yes, that would be another approach. > > > include/linux/sched/coredump.h | 1 + > > mm/oom_kill.c | 80 ++++++++++++++++++++---------------------- > > 2 files changed, 40 insertions(+), 41 deletions(-) > > > > diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h > > index 98ae0d0..6b6237b 100644 > > --- a/include/linux/sched/coredump.h > > +++ b/include/linux/sched/coredump.h > > @@ -62,6 +62,7 @@ static inline int get_dumpable(struct mm_struct *mm) > > * on NFS restore > > */ > > //#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ > > +#define MMF_OOM_REAPING 18 /* mm is supposed to be reaped */ > > A new flag is not really needed. We can increase it for _each_ reapable > oom victim. Yes if based on an assumption that number of mark_oom_victim() calls and wake_oom_reaper() calls matches... > > > @@ -658,6 +643,13 @@ static void mark_oom_victim(struct task_struct *tsk) > > if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) > > mmgrab(tsk->signal->oom_mm); > > > > +#ifdef CONFIG_MMU > > + if (!test_bit(MMF_OOM_REAPING, &mm->flags)) { > > + set_bit(MMF_OOM_REAPING, &mm->flags); > > + mmget(mm); > > + } > > +#endif > > This would really need a big fat warning explaining why we do not need > mmget_not_zero. We rely on exit_mm doing both mmput and tsk->mm = NULL > under the task_lock and mark_oom_victim is called under this lock as > well and task_will_free_mem resp. find_lock_task_mm makes sure we do not > even consider tasks wihout mm. > > I agree that a solution which is fully contained inside the oom proper > would be preferable to touching __mmput path. OK. Updated patch shown below. ---------------------------------------- >From 5ed8922bd281456793408328c8b27899ebdd298b Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> Date: Sat, 17 Jun 2017 14:04:09 +0900 Subject: [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims. David Rientjes has reported that the OOM killer can select next OOM victim when existing OOM victims called __mmput() before the OOM reaper starts trying to unmap pages. In his testing, 4.12-rc kernels are killing 1-4 processes unnecessarily for each OOM condition. ---------- One oom kill shows the system to be oom: [22999.488705] Node 0 Normal free:90484kB min:90500kB ... [22999.488711] Node 1 Normal free:91536kB min:91948kB ... followed up by one or more unnecessary oom kills showing the oom killer racing with memory freeing of the victim: [22999.510329] Node 0 Normal free:229588kB min:90500kB ... [22999.510334] Node 1 Normal free:600036kB min:91948kB ... ---------- This is because commit e5e3f4c4f0e95ecb ("mm, oom_reaper: make sure that mmput_async is called only when memory was reaped") kept not to set MMF_OOM_REAPED flag when the OOM reaper found that mm_users == 0 but then commit 26db62f179d112d3 ("oom: keep mm of the killed task available") by error changed to always set MMF_OOM_REAPED flag. As a result, MMF_OOM_SKIP flag is immediately set without waiting for __mmput() because __mmput() might get stuck before setting MMF_OOM_SKIP flag, and led to above report. A workaround is to let the OOM reaper wait for a while and give up via timeout as if the OOM reaper was unable to take mmap_sem for read. But we want to avoid timeout based approach if possible. Therefore, this patch takes a different approach. This patch elevates mm_users of an OOM victim's mm, and prevents the OOM victim from calling __mmput() before the OOM reaper starts trying to unmap pages. In this way, we can force the OOM reaper to try to reclaim some memory before setting MMF_OOM_SKIP flag. Since commit 862e3073b3eed13f ("mm, oom: get rid of signal_struct::oom_victims") changed to keep the OOM killer disabled until MMF_OOM_SKIP is set on the victim's mm rather than until TIF_MEMDIE is cleared from the victim's thread, we can keep the OOM killer disabled until __oom_reap_task_mm() or __mmput() reclaims some memory and sets MMF_OOM_SKIP flag. That is, elevating mm_users throughout the reaping procedure does not cause premature victim selection. This patch also reduces the range of oom_lock protection in __oom_reap_task_mm() introduced by commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task"). Since the OOM killer is kept disabled until MMF_OOM_SKIP flag is set, we don't need to serialize throughout the reaping procedure; serializing only setting MMF_OOM_SKIP flag is enough. This allows the OOM reaper to start reclaiming memory as soon as wake_oom_reaper() is called by the OOM killer in order to compensate for delaying exit_mmap() call caused by elevating mm_users of the OOM victim's mm when the OOM victim can exit smoothly, for currently the OOM killer calls schedule_timeout_killable(1) at out_of_memory() with oom_lock held. This might also allow direct OOM reaping (i.e. let the OOM killer call __oom_reap_task_mm() because we would have 16KB kernel stack with stack overflow detection) and replace the OOM reaper kernel thread with a workqueue (i.e. save one kernel thread). Reported-by: David Rientjes <rientjes@xxxxxxxxxx> Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> Fixes: 26db62f179d112d3 ("oom: keep mm of the killed task available") Cc: Michal Hocko <mhocko@xxxxxxxx> --- mm/oom_kill.c | 94 ++++++++++++++++++++++++++++++++--------------------------- 1 file changed, 51 insertions(+), 43 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 04c9143..cf1d331 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -470,38 +470,9 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { struct mmu_gather tlb; struct vm_area_struct *vma; - bool ret = true; - - /* - * We have to make sure to not race with the victim exit path - * and cause premature new oom victim selection: - * __oom_reap_task_mm exit_mm - * mmget_not_zero - * mmput - * atomic_dec_and_test - * exit_oom_victim - * [...] - * out_of_memory - * select_bad_process - * # no TIF_MEMDIE task selects new victim - * unmap_page_range # frees some memory - */ - mutex_lock(&oom_lock); - - if (!down_read_trylock(&mm->mmap_sem)) { - ret = false; - goto unlock_oom; - } - /* - * increase mm_users only after we know we will reap something so - * that the mmput_async is called only when we have reaped something - * and delayed __mmput doesn't matter that much - */ - if (!mmget_not_zero(mm)) { - up_read(&mm->mmap_sem); - goto unlock_oom; - } + if (!down_read_trylock(&mm->mmap_sem)) + return false; /* * Tell all users of get_user/copy_from_user etc... that the content @@ -537,16 +508,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) K(get_mm_counter(mm, MM_FILEPAGES)), K(get_mm_counter(mm, MM_SHMEMPAGES))); up_read(&mm->mmap_sem); - - /* - * Drop our reference but make sure the mmput slow path is called from a - * different context because we shouldn't risk we get stuck there and - * put the oom_reaper out of the way. - */ - mmput_async(mm); -unlock_oom: - mutex_unlock(&oom_lock); - return ret; + return true; } #define MAX_OOM_REAP_RETRIES 10 @@ -569,12 +531,31 @@ static void oom_reap_task(struct task_struct *tsk) done: tsk->oom_reaper_list = NULL; + /* + * Drop a mm_users reference taken by mark_oom_victim(). + * A mm_count reference taken by mark_oom_victim() remains. + */ + mmput_async(mm); /* * Hide this mm from OOM killer because it has been either reaped or * somebody can't call up_write(mmap_sem). + * + * Serialize setting of MMF_OOM_SKIP using oom_lock in order to + * avoid race with select_bad_process() which causes premature + * new oom victim selection. + * + * The OOM reaper: An allocating task: + * Failed get_page_from_freelist(). + * Enters into out_of_memory(). + * Reaped memory enough to make get_page_from_freelist() succeed. + * Sets MMF_OOM_SKIP to mm. + * Enters into select_bad_process(). + * # MMF_OOM_SKIP mm selects new victim. */ + mutex_lock(&oom_lock); set_bit(MMF_OOM_SKIP, &mm->flags); + mutex_unlock(&oom_lock); /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); @@ -602,12 +583,16 @@ static int oom_reaper(void *unused) static void wake_oom_reaper(struct task_struct *tsk) { - if (!oom_reaper_th) + if (!oom_reaper_th) { + mmput_async(tsk->signal->oom_mm); return; + } /* tsk is already queued? */ - if (tsk == oom_reaper_list || tsk->oom_reaper_list) + if (tsk == oom_reaper_list || tsk->oom_reaper_list) { + mmput_async(tsk->signal->oom_mm); return; + } get_task_struct(tsk); @@ -650,12 +635,32 @@ static void mark_oom_victim(struct task_struct *tsk) struct mm_struct *mm = tsk->mm; WARN_ON(oom_killer_disabled); +#ifdef CONFIG_MMU + /* + * Take a mm_users reference so that __oom_reap_task_mm() can unmap + * pages without risking a race condition where final mmput() from + * exit_mm() from do_exit() triggered __mmput() and gets stuck there + * (but __oom_reap_task_mm() cannot unmap pages due to mm_users == 0). + * + * Since all callers guarantee that this mm is stable (hold task_lock + * or task is current), we can safely use mmget() here. + * + * When dropping this reference, mmput_async() has to be used because + * __mmput() can get stuck which in turn keeps the OOM killer/reaper + * disabled forever. + */ + mmget(mm); +#endif /* OOM killer might race with memcg OOM */ if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) return; /* oom_mm is bound to the signal struct life time. */ if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) + /* + * Take a mm_count reference so that we can examine flags value + * when tsk_is_oom_victim() is true. + */ mmgrab(tsk->signal->oom_mm); /* @@ -908,6 +913,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message) if (is_global_init(p)) { can_oom_reap = false; set_bit(MMF_OOM_SKIP, &mm->flags); +#ifdef CONFIG_MMU + mmput_async(mm); +#endif pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n", task_pid_nr(victim), victim->comm, task_pid_nr(p), p->comm); -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>