On Sat, Jul 11, 2020 at 11:18 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > If the current's MMF_OOM_SKIP is set, it means that the current is exiting > or dying and likely to realease its address space. So we don't need to > invoke the oom killer again. Otherwise that may cause some unexpected > issues, for example, bellow is the issue found in our production > environment. > > There're many threads of a multi-threaded task parallel running in a > container on many cpus. Then many threads triggered OOM at the same time, > > CPU-1 CPU-2 ... CPU-n > thread-1 thread-2 ... thread-n > > wait oom_lock wait oom_lock ... hold oom_lock > > (sigkill received) > > select current as victim > and wakeup oom reaper > > release oom_lock > > (MMF_OOM_SKIP set by oom reaper) > > (lots of pages are freed) > hold oom_lock > > because MMF_OOM_SKIP > is set, kill others > > The thread running on CPU-n received sigkill and it will select current as > the victim and wakeup the oom reaper. Then oom reaper will reap its rss and > free lots of pages, as a result, there will be many free pages. > Although the multi-threaded task is exiting, the other threads will > continue to kill others because of the check of MMF_OOM_SKIP in > task_will_free_mem(). > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > --- > mm/oom_kill.c | 14 ++++++-------- > 1 file changed, 6 insertions(+), 8 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 6e94962..a8a155a 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -825,13 +825,6 @@ static bool task_will_free_mem(struct task_struct *task) > if (!__task_will_free_mem(task)) > return false; > > - /* > - * This task has already been drained by the oom reaper so there are > - * only small chances it will free some more > - */ > - if (test_bit(MMF_OOM_SKIP, &mm->flags)) > - return false; > - > if (atomic_read(&mm->mm_users) <= 1) > return true; > > @@ -963,7 +956,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > * so it can die quickly > */ > task_lock(victim); > - if (task_will_free_mem(victim)) { > + if (!test_bit(MMF_OOM_SKIP, &victim->mm->flags) && > + task_will_free_mem(victim)) { > mark_oom_victim(victim); > wake_oom_reaper(victim); > task_unlock(victim); > @@ -1056,6 +1050,10 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + /* current has been already reapered */ > + if (test_bit(MMF_OOM_SKIP, ¤t->mm->flags)) > + return true; > + Oops. Should check whether mm is NULL first: if (mm && test_bit(MMF_OOM_SKIP, mm->flags)) > /* > * If current has a pending SIGKILL or is exiting, then automatically > * select it. The goal is to allow it to allocate so that it may > -- > 1.8.3.1 > -- Thanks Yafang