Re: [PATCH] mm, memcg: introduce per memcg oom_score_adj

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 22, 2019 at 12:59:18PM +0200, Michal Hocko wrote:
> On Thu 22-08-19 17:34:54, Yafang Shao wrote:
> > On Thu, Aug 22, 2019 at 5:19 PM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > >
> > > On Thu 22-08-19 04:56:29, Yafang Shao wrote:
> > > > - Why we need a per memcg oom_score_adj setting ?
> > > > This is easy to deploy and very convenient for container.
> > > > When we use container, we always treat memcg as a whole, if we have a per
> > > > memcg oom_score_adj setting we don't need to set it process by process.
> > >
> > > Why cannot an initial process in the cgroup set the oom_score_adj and
> > > other processes just inherit it from there? This sounds trivial to do
> > > with a startup script.
> > >
> > 
> > That is what we used to do before.
> > But it can't apply to the running containers.
> > 
> > 
> > > > It will make the user exhausted to set it to all processes in a memcg.
> > >
> > > Then let's have scripts to set it as they are less prone to exhaustion
> > > ;)
> > 
> > That is not easy to deploy it to the production environment.
> 
> What is hard about a simple loop over tasklist exported by cgroup and
> apply a value to oom_score_adj?
> 
> [...]
> 
> > > Besides that. What is the hierarchical semantic? Say you have hierarchy
> > >         A (oom_score_adj = 1000)
> > >          \
> > >           B (oom_score_adj = 500)
> > >            \
> > >             C (oom_score_adj = -1000)
> > >
> > > put the above summing up aside for now and just focus on the memcg
> > > adjusting?
> > 
> > I think that there's no conflict between children's oom_score_adj,
> > that is different with memory.max.
> > So it is not neccessary to consider the parent's oom_sore_adj.
> 
> Each exported cgroup tuning _has_ to be hierarchical so that an admin
> can override children setting in order to safely delegate the
> configuration.

+1

> 
> Last but not least, oom_score_adj has proven to be a terrible interface
> that is essentially close to unusable to anything outside of extreme
> values (-1000 and very arguably 1000). Making it cgroup aware without
> changing oom victim selection to consider cgroup as a whole will also be
> a pain so I am afraid that this is a dead end path.
> 
> We can discuss cgroup aware oom victim selection for sure and there are
> certainly reasonable usecases to back that functionality. Please refer
> to discussion from 2017/2018 (dubbed as "cgroup-aware OOM killer"). But
> be warned this is a tricky area and there was a fundamental disagreement
> on how things should be classified without a clear way to reach
> consensus. What we have right now is the only agreement we could reach.
> It is likely possible that the only more clever cgroup aware oom
> selection has to be implemented in the userspace with an understanding
> of the specific workload.

I think the agreement is that the main goal of the kernel OOM killer is to
prevent different memory dead- and live-lock scenarios. And everything
that involves policies which define which workloads are preferable over
others should be kept in userspace.

So the biggest issue of the kernel OOM killer right now is that it often kicks
in too late, if at all (which has been discussed recently). And it looks like
the best answer now is PSI. So I'd really look into that direction to enhance
it.

Thanks!





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux