Re: [RFC PATCH v2 0/7] cgroup-aware OOM killer

Michal Hocko <mhocko@xxxxxxxxxx> · Mon, 26 Jun 2017 13:55:31 +0200

On Fri 23-06-17 19:39:46, Roman Gushchin wrote:
> On Fri, Jun 23, 2017 at 03:43:24PM +0200, Michal Hocko wrote:
> > On Thu 22-06-17 18:10:03, Roman Gushchin wrote:
> > > Hi, Michal!
> > > 
> > > Thank you very much for the review. I've tried to address your
> > > comments in v3 (sent yesterday), so that is why it took some time to reply.
> > 
> > I will try to look at it sometimes next week hopefully
> 
> Thanks!
> 
> > > > - You seem to completely ignore per task oom_score_adj and override it
> > > >   by the memcg value. This makes some sense but it can lead to an
> > > >   unexpected behavior when somebody relies on the original behavior.
> > > >   E.g. a workload that would corrupt data when killed unexpectedly and
> > > >   so it is protected by OOM_SCORE_ADJ_MIN. Now this assumption will
> > > >   break when running inside a container. I do not have a good answer
> > > >   what is the desirable behavior and maybe there is no universal answer.
> > > >   Maybe you just do not to kill those tasks? But then you have to be
> > > >   careful when selecting a memcg victim. Hairy...
> > > 
> > > I do not ignore it completely, but it matters only for root cgroup tasks
> > > and inside a cgroup when oom_kill_all_tasks is off.
> > > 
> > > I believe, that cgroup v2 requirement is a good enough. I mean you can't
> > > move from v1 to v2 without changing cgroup settings, and if we will provide
> > > per-cgroup oom_score_adj, it will be enough to reproduce the old behavior.
> > > 
> > > Also, if you think it's necessary, I can add a sysctl to turn the cgroup-aware
> > > oom killer off completely and provide compatibility mode.
> > > We can't really save the old system-wide behavior of per-process oom_score_adj,
> > > it makes no sense in the containerized environment.
> > 
> > So what you are going to do with those applications that simply cannot
> > be killed and which set OOM_SCORE_ADJ_MIN explicitly. Are they
> > unsupported? How does a user find out? One way around this could be to
> > simply to not kill tasks with OOM_SCORE_ADJ_MIN.
> 
> They won't be killed by cgroup OOM, but under some circumstances can be killed
> by the global OOM (e.g. there are no other tasks in the selected cgroup,
> cgroup v2 is used, and per-cgroup oom score adjustment is not set).

Hmm, mem_cgroup_select_oom_victim will happily select a memcg which
contains OOM_SCORE_ADJ_MIN tasks because it ignores per-task score adj.
So memcg OOM killer can kill those tasks AFAICS. But that is not all
that important. Becasuse...

> I believe, that per-process oom_score_adj should not play any role outside
> of the containing cgroup, it's violation of isolation.
> 
> Right now if tasks with oom_score_adj=-1000 eating all memory in a cgroup,
> they will be looping forever, OOM killer can't fix this.

... Yes and that is a price we have to pay for the hard requirement
that oom killer never kills OOM_SCORE_ADJ_MIN task. It is hard to
change that without breaking any existing userspace which relies on the
configuration to protect from an unexpected SIGKILL.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>