Re: [RFC PATCH v2 0/7] cgroup-aware OOM killer

Michal Hocko <mhocko@xxxxxxxxxx> · Fri, 23 Jun 2017 15:43:24 +0200

On Thu 22-06-17 18:10:03, Roman Gushchin wrote:
> Hi, Michal!
> 
> Thank you very much for the review. I've tried to address your
> comments in v3 (sent yesterday), so that is why it took some time to reply.

I will try to look at it sometimes next week hopefully

[...]
> > - You seem to completely ignore per task oom_score_adj and override it
> >   by the memcg value. This makes some sense but it can lead to an
> >   unexpected behavior when somebody relies on the original behavior.
> >   E.g. a workload that would corrupt data when killed unexpectedly and
> >   so it is protected by OOM_SCORE_ADJ_MIN. Now this assumption will
> >   break when running inside a container. I do not have a good answer
> >   what is the desirable behavior and maybe there is no universal answer.
> >   Maybe you just do not to kill those tasks? But then you have to be
> >   careful when selecting a memcg victim. Hairy...
> 
> I do not ignore it completely, but it matters only for root cgroup tasks
> and inside a cgroup when oom_kill_all_tasks is off.
> 
> I believe, that cgroup v2 requirement is a good enough. I mean you can't
> move from v1 to v2 without changing cgroup settings, and if we will provide
> per-cgroup oom_score_adj, it will be enough to reproduce the old behavior.
> 
> Also, if you think it's necessary, I can add a sysctl to turn the cgroup-aware
> oom killer off completely and provide compatibility mode.
> We can't really save the old system-wide behavior of per-process oom_score_adj,
> it makes no sense in the containerized environment.

So what you are going to do with those applications that simply cannot
be killed and which set OOM_SCORE_ADJ_MIN explicitly. Are they
unsupported? How does a user find out? One way around this could be to
simply to not kill tasks with OOM_SCORE_ADJ_MIN.

> > - While we are at it oom_score_adj has turned out to be quite unusable
> >   for a sensible oom prioritization from my experience. Practically
> >   it reduced to disable/enforce the task for selection. The scale is
> >   quite small as well. There certainly is a need for prioritization
> >   and maybe a completely different api would be better. Maybe a simple
> >   priority in (0, infinity) range will be better. Priority could be used
> >   either as the only criterion or as a tie breaker when consumption of
> >   more memcgs is too close (this could be implemented for each strategy
> >   in a different way if we go modules way)
> 
> I had such a version, but the the question how to compare root cgroup tasks
> and cgroups becomes even harder. But if you have any ideas here, it's
> just great. I do not like this [-1000, 1000] range at all.

Dunno, would have to think about that some more.

> > - oom_kill_all_tasks should be hierarchical and consistent within a
> >   hierarchy. Or maybe it should be applicable to memcgs with tasks (leaf
> >   cgroups). Although selecting a memcg higher in the hierarchy kill all
> >   tasks in that hierarchy makes some sense as well IMHO. Say you
> >   delegate a hierarchy to an unprivileged user and still want to contain
> >   that user.
> 
> It's already hierarchical, or did I missed something? Please, explain
> what you mean. If turned on for any cgroup (except root), it forces oom killer
> to treat the whole cgroup as a single enitity, and kill all belonging tasks
> if it is selected to be the oom victim.

My fault! I can see that mem_cgroup_scan_tasks will iterate over whole
hierarchy (with oc->chosen_memcg root). Setting oom_kill_all_tasks
differently than up the hierarchy can be confusing but it makes sense
and it is consistent with other knobs.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>