On 2019/07/01 20:17, Michal Hocko wrote: > On Sat 29-06-19 20:24:34, Tetsuo Handa wrote: >> Since mpol_put_task_policy() in do_exit() sets mempolicy = NULL, >> mempolicy_nodemask_intersects() considers exited threads (e.g. a process >> with dying leader and live threads) as eligible. But it is possible that >> all of live threads are still ineligible. >> >> Since has_intersects_mems_allowed() returns true as soon as one of threads >> is considered eligible, mempolicy_nodemask_intersects() needs to consider >> exited threads as ineligible. Since exit_mm() in do_exit() sets mm = NULL >> before mpol_put_task_policy() sets mempolicy = NULL, we can exclude exited >> threads by checking whether mm is NULL. > > Ok, this makes sense. For this change > Acked-by: Michal Hocko <mhocko@xxxxxxxx> > But I realized that this patch was too optimistic. We need to wait for mm-less threads until MMF_OOM_SKIP is set if the process was already an OOM victim. If we fail to allow the process to reach MMF_OOM_SKIP test, the process will be ignored by the OOM killer as soon as all threads pass mm = NULL at exit_mm(), for has_intersects_mems_allowed() returns false unless MPOL_{BIND,INTERLEAVE} is used. Well, the problem is that exited threads prematurely set mempolicy = NULL. Since bitmap memory for cpuset_mems_allowed_intersects() path is freed when __put_task_struct() is called, mempolicy memory for mempolicy_nodemask_intersects() path should be freed as well when __put_task_struct() is called? diff --git a/kernel/exit.c b/kernel/exit.c index a75b6a7..02a60ea 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -897,7 +897,6 @@ void __noreturn do_exit(long code) exit_tasks_rcu_start(); exit_notify(tsk, group_dead); proc_exit_connector(tsk); - mpol_put_task_policy(tsk); #ifdef CONFIG_FUTEX if (unlikely(current->pi_state_cache)) kfree(current->pi_state_cache); diff --git a/kernel/fork.c b/kernel/fork.c index 6166790..c17e436 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -726,6 +726,7 @@ void __put_task_struct(struct task_struct *tsk) WARN_ON(refcount_read(&tsk->usage)); WARN_ON(tsk == current); + mpol_put_task_policy(tsk); cgroup_free(tsk); task_numa_free(tsk); security_task_free(tsk);