Hi, based on the feedback from Tetsuo and Vladimir (thanks to you both) I had to change some of my assumptions and rework some patches. I planned to resend later this week but I guess it would help to argue about the code after those changes if I resubmit earlier. The previous version was posted here http://lkml.kernel.org/r/1464266415-15558-1-git-send-email-mhocko@xxxxxxxxxx The following 6 patches should put some order to very rare cases of mm shared between processes and make the paths which bypass the oom killer oom reapable and so much more reliable finally. Even though mm shared outside of threadgroup is rare (either vforked tasks for a short period, use_mm by kernel threads or exotic thread model of clone(CLONE_VM) without CLONE_THREAD resp. CLONE_SIGHAND). Not only it makes the current oom killer logic quite hard to follow and evaluate it can lead to weird corner cases. E.g. it is possible to select an oom victim which shares the mm with unkillable process or bypass the oom killer even when other processes sharing the mm are still alive and other weird cases. Patch 1 drops a bogus task_lock and mm check from oom_adj_write. This can be considered a bug fix with a low impact as nobody has noticed for years. Patch 2 is a clean up of oom_score_adj handling and a preparatory work. Patch 3 enforces oom_adj_score to be consistent between processes sharing the mm to behave consistently with the regular thread groups. This can be considered a user visible behavior change because one thread group oom_score_adj update will affect others which share the same mm via clone(CLONE_VM). I argue that this should be acceptable because we already have the same behavior for threads in the same thread group and sharing the mm without signal struct is just a different model of threading. This is probably the most controversial part of the series, I would like to find some consensus here though. There were some suggestions to hook some counter/oom_score_adj into the mm_struct but I feel that this is not necessary right now and we can rely on proc handler + oom_kill_process to DTRT. I can be convinced otherwise but I strongly think that whatever we do the userspace has to have a way to see the current oom priority as consistently as possible. Patch 4 makes sure that no vforked task is selected if it is sharing the mm with oom unkillable task. Patch 5 ensures that all tasks sharing the mm are killed which in turn makes sure that all oom victims are oom reapable. Patch 6 guarantees that task_will_free_mem will always imply reapable bypass of the oom killer. The patchset is based on the current mmotm tree (mmotm-2016-05-27-15-19). I would really appreciate a deep review as this area is full of land mines but I hope I've made the code much cleaner with less kludges. I am CCing Oleg (sorry I know you hate this code) but I would feel much better if you double checked my assumptions about locking and vfork behavior. Michal Hocko (6): proc, oom: drop bogus task_lock and mm check proc, oom_adj: extract oom_score_adj setting into a helper mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj mm, oom: skip vforked tasks from being selected mm, oom: kill all tasks sharing the mm mm, oom: fortify task_will_free_mem fs/proc/base.c | 172 ++++++++++++++++++++++++++++++---------------------- include/linux/mm.h | 2 + include/linux/oom.h | 63 +++++++++++++++++-- mm/memcontrol.c | 4 +- mm/oom_kill.c | 82 +++++-------------------- 5 files changed, 176 insertions(+), 147 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>