Re: [PATCH] mm, memcg: introduce per memcg oom_score_adj

Michal Hocko <mhocko@xxxxxxxx> · Thu, 22 Aug 2019 12:59:18 +0200

On Thu 22-08-19 17:34:54, Yafang Shao wrote:
> On Thu, Aug 22, 2019 at 5:19 PM Michal Hocko <mhocko@xxxxxxxx> wrote:
> >
> > On Thu 22-08-19 04:56:29, Yafang Shao wrote:
> > > - Why we need a per memcg oom_score_adj setting ?
> > > This is easy to deploy and very convenient for container.
> > > When we use container, we always treat memcg as a whole, if we have a per
> > > memcg oom_score_adj setting we don't need to set it process by process.
> >
> > Why cannot an initial process in the cgroup set the oom_score_adj and
> > other processes just inherit it from there? This sounds trivial to do
> > with a startup script.
> >
> 
> That is what we used to do before.
> But it can't apply to the running containers.
> 
> 
> > > It will make the user exhausted to set it to all processes in a memcg.
> >
> > Then let's have scripts to set it as they are less prone to exhaustion
> > ;)
> 
> That is not easy to deploy it to the production environment.

What is hard about a simple loop over tasklist exported by cgroup and
apply a value to oom_score_adj?

[...]

> > Besides that. What is the hierarchical semantic? Say you have hierarchy
> >         A (oom_score_adj = 1000)
> >          \
> >           B (oom_score_adj = 500)
> >            \
> >             C (oom_score_adj = -1000)
> >
> > put the above summing up aside for now and just focus on the memcg
> > adjusting?
> 
> I think that there's no conflict between children's oom_score_adj,
> that is different with memory.max.
> So it is not neccessary to consider the parent's oom_sore_adj.

Each exported cgroup tuning _has_ to be hierarchical so that an admin
can override children setting in order to safely delegate the
configuration.

Last but not least, oom_score_adj has proven to be a terrible interface
that is essentially close to unusable to anything outside of extreme
values (-1000 and very arguably 1000). Making it cgroup aware without
changing oom victim selection to consider cgroup as a whole will also be
a pain so I am afraid that this is a dead end path.

We can discuss cgroup aware oom victim selection for sure and there are
certainly reasonable usecases to back that functionality. Please refer
to discussion from 2017/2018 (dubbed as "cgroup-aware OOM killer"). But
be warned this is a tricky area and there was a fundamental disagreement
on how things should be classified without a clear way to reach
consensus. What we have right now is the only agreement we could reach.
It is likely possible that the only more clever cgroup aware oom
selection has to be implemented in the userspace with an understanding
of the specific workload.
-- 
Michal Hocko
SUSE Labs