David Rientjes wrote: > This doesn't prevent serial oom killing for either the system oom killer > or for the memcg oom killer. > > The oom killer cannot detect tsk_is_oom_victim() if the task has either > been removed from the tasklist or has already done cgroup_exit(). For > memcg oom killings in particular, cgroup_exit() is usually called very > shortly after the oom killer has sent the SIGKILL. If the oom reaper does > not fail (for example by failing to grab mm->mmap_sem) before another > memcg charge after cgroup_exit(victim), additional processes are killed > because the iteration does not view the victim. > > This easily kills all processes attached to the memcg with no memory > freeing from any victim. Umm... So, you are pointing out that select_bad_process() aborts based on TIF_MEMDIE or MMF_OOM_SKIP is broken because victim threads can be removed from global task list or cgroup's task list. Then, the OOM killer will have to wait until all mm_struct of interested OOM domain (system wide or some cgroup) is reaped by the OOM reaper. Simplest way is to wait until all mm_struct are reaped by the OOM reaper, for currently we are not tracking which memory cgroup each mm_struct belongs to, are we? But that can cause needless delay when multiple OOM events occurred in different OOM domains. Do we want to (and can we) make it possible to tell whether each mm_struct queued to the OOM reaper's list belongs to the thread calling out_of_memory() ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>