Michal Hocko wrote: > On Tue 11-09-18 00:40:23, Tetsuo Handa wrote: > > >> Also, why MMF_OOM_SKIP will not be set if the OOM reaper handed over? > > > > > > The idea is that the mm is not visible to anybody (except for the oom > > > reaper) anymore. So MMF_OOM_SKIP shouldn't matter. > > > > > > > I think it absolutely matters. The OOM killer waits until MMF_OOM_SKIP is set > > on a mm which is visible via task_struct->signal->oom_mm . > > Hmm, I have to re-read the exit path once again and see when we unhash > the task and how many dangerous things we do in the mean time. I might > have been overly optimistic and you might be right that we indeed have > to set MMF_OOM_SKIP after all. What a foolhardy attempt! Commit d7a94e7e11badf84 ("oom: don't count on mm-less current process") says out_of_memory() doesn't trigger the OOM killer if the current task is already exiting or it has fatal signals pending, and gives the task access to memory reserves instead. However, doing so is wrong if out_of_memory() is called by an allocation (e.g. from exit_task_work()) after the current task has already released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by the task sitting in the final schedule() waiting for its parent to reap it. It will trigger an OOM livelock if its parent is unable to reap it due to doing an allocation and waiting for the OOM killer to kill it. and your + /* + * the exit path is guaranteed to finish without any unbound + * blocking at this stage so make it clear to the caller. + */ attempt is essentially same with "we keep TIF_MEMDIE of post-exit_mm() task". That is, we can't expect that the OOM victim can finish without any unbound blocking. We have no choice but timeout based heuristic if we don't want to set MMF_OOM_SKIP even with your customized version of free_pgtables().