Re: cgroup-aware OOM killer, how to move forward

Michal Hocko <mhocko@xxxxxxxxxx> · Thu, 19 Jul 2018 09:38:43 +0200

On Wed 18-07-18 08:28:50, Roman Gushchin wrote:
> On Wed, Jul 18, 2018 at 10:12:30AM +0200, Michal Hocko wrote:
> > On Tue 17-07-18 13:06:42, Roman Gushchin wrote:
> > > On Tue, Jul 17, 2018 at 09:49:46PM +0200, Michal Hocko wrote:
> > > > On Tue 17-07-18 10:38:45, Roman Gushchin wrote:
> > > > [...]
> > > > > Let me show my proposal on examples. Let's say we have the following hierarchy,
> > > > > and the biggest process (or the process with highest oom_score_adj) is in D.
> > > > > 
> > > > >   /
> > > > >   |
> > > > >   A
> > > > >   |
> > > > >   B
> > > > >  / \
> > > > > C   D
> > > > > 
> > > > > Let's look at different examples and intended behavior:
> > > > > 1) system-wide OOM
> > > > >   - default settings: the biggest process is killed
> > > > >   - D/memory.group_oom=1: all processes in D are killed
> > > > >   - A/memory.group_oom=1: all processes in A are killed
> > > > > 2) memcg oom in B
> > > > >   - default settings: the biggest process is killed
> > > > >   - A/memory.group_oom=1: the biggest process is killed
> > > > 
> > > > Huh? Why would you even consider A here when the oom is below it?
> > > > /me confused
> > > 
> > > I do not.
> > > This is exactly a counter-example: A's memory.group_oom
> > > is not considered at all in this case,
> > > because A is above ooming cgroup.
> > 
> > OK, it confused me.
> > 
> > > > 
> > > > >   - B/memory.group_oom=1: all processes in B are killed
> > > > 
> > > >     - B/memory.group_oom=0 &&
> > > > >   - D/memory.group_oom=1: all processes in D are killed
> > > > 
> > > > What about?
> > > >     - B/memory.group_oom=1 && D/memory.group_oom=0
> > > 
> > > All tasks in B are killed.
> > 
> > so essentially find a task, traverse the memcg hierarchy from the
> > victim's memcg up to the oom root as long as memcg.group_oom = 1?
> > If the resulting memcg.group_oom == 1 then kill the whole sub tree.
> > Right?
> 
> Yes.
> 
> > 
> > > Group_oom set to 1 means that the workload can't tolerate
> > > killing of a random process, so in this case it's better
> > > to guarantee consistency for B.
> > 
> > OK, but then if D itself is OOM then we do not care about consistency
> > all of the sudden? I have hard time to think about a sensible usecase.
> 
> I mean if traversing the hierarchy up to the oom root we meet
> a memcg with group_oom set to 0, we shouldn't stop traversing.

Well, I am still fighting with the semantic of group, no-group, group
configuration. Why does it make any sense? In other words when can we
consider a cgroup to be a indivisible workload for one oom context while
it is fine to lose head or arm from another?

Anyway, your previous implementation would allow the same configuration
as well, so this is nothing really new. The new selection policy you are
proposing just makes it more obvious. So that doesn't mean this is a
roadblock but I think we should be really thinking hard about this.
-- 
Michal Hocko
SUSE Labs