On Mon 16-03-20 15:35:10, Roman Gushchin wrote: > If a task is getting moved out of the OOMing cgroup, it might > result in unexpected OOM killings if memory.oom.group is used > anywhere in the cgroup tree. > > Imagine the following example: > > A (oom.group = 1) > / \ > (OOM) B C > > Let's say B's memory.max is exceeded and it's OOMing. The OOM killer > selects a task in B as a victim, but someone asynchronously moves > the task into C. I can see Reported-by here, does that mean that the race really happened in real workloads? If yes, I would be really curious. Mostly because moving tasks outside of the oom domain is quite questionable without charge migration. > mem_cgroup_get_oom_group() will iterate over all > ancestors of C up to the root cgroup. In theory it had to stop > at the oom_domain level - the memory cgroup which is OOMing. > But because B is not an ancestor of C, it's not happening. > Instead it chooses A (because it's oom.group is set), and kills > all tasks in A. This behavior is wrong because the OOM happened in B, > so there is no reason to kill anything outside. > > Fix this by checking it the memory cgroup to which the task belongs > is a descendant of the oom_domain. If not, memory.oom.group should > be ignored, and the OOM killer should kill only the victim task. I was about to suggest storing the memcg in oom_evaluate_task but then I have realized that this would be both more complex and I am not yet sure it would be better so much better after all. The thing is that killing the selected task makes a lot of sense because it was the largest consumer. No matter it has run away. On the other hand if your B was oom.group = 1 then one could expect that any OOM killer event in that group will result in the whole group tear down. This is however a gray zone because we do emit MEMCG_OOM event but MEMCG_OOM_KILL event will go to the victim's at-the-time memcg. So the observer B could think that the oom was resolved without killing while observer C would see a kill event without oom. That being said, please try to think about the above. I will give it some more time as well. Killing the selected victim is the obviously correct thing and your patch does that so it is correct in that regard but I believe that the group oom behavior in the original oom domain remains an open question. Fixes: 3d8b38eb81ca ("mm, oom: introduce memory.oom.group") > Signed-off-by: Roman Gushchin <guro@xxxxxx> > Reported-by: Dan Schatzberg <dschatzberg@xxxxxx> > --- > mm/memcontrol.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index daa399be4688..d8c4b7aa4e73 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1930,6 +1930,14 @@ struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, > if (memcg == root_mem_cgroup) > goto out; > > + /* > + * If the victim task has been asynchronously moved to a different > + * memory cgroup, we might end up killing tasks outside oom_domain. > + * In this case it's better to ignore memory.group.oom. > + */ > + if (unlikely(!mem_cgroup_is_descendant(memcg, oom_domain))) > + goto out; > + > /* > * Traverse the memory cgroup hierarchy from the victim task's > * cgroup up to the OOMing cgroup (or root) to find the > -- > 2.24.1 -- Michal Hocko SUSE Labs