Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

David Rientjes <rientjes@xxxxxxxxxx> · Tue, 10 Oct 2017 14:13:00 -0700 (PDT)

On Tue, 10 Oct 2017, Roman Gushchin wrote:

> > This seems to unfairly bias the root mem cgroup depending on process size.  
> > It isn't treated fairly as a leaf mem cgroup if they are being compared 
> > based on different criteria: the root mem cgroup as (mostly) the largest 
> > rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> > unreclaimable slab pages charged to it by all processes.
> > 
> > I imagine a configuration where the root mem cgroup has 100 processes 
> > attached each with rss of 80MB, compared to a leaf cgroup with 100 
> > processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> > killing the processes of 1MB rss?
> > 
> > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> > quite fair, it can simply hide large processes from being selected.  Users 
> > who configure cgroups in a unified hierarchy for other resource 
> > constraints are penalized for this choice even though the mem cgroup with 
> > 100 processes of 1MB rss each may not be limited itself.
> > 
> > I think for this comparison to be fair, it requires accounting for the 
> > root mem cgroup itself or for a different accounting methodology for leaf 
> > memory cgroups.
> 
> This is basically a workaround, because we don't have necessary stats for root
> memory cgroup. If we'll start gathering them at some point, we can change this
> and treat root memcg exactly as other leaf cgroups.
> 

I understand why it currently cannot be an apples vs apples comparison 
without, as I suggest in the last paragraph, that the same accounting is 
done for the root mem cgroup, which is intuitive if it is to be considered 
on the same basis as leaf mem cgroups.

I understand for the design to work that leaf mem cgroups and the root mem 
cgroup must be compared if processes can be attached to the root mem 
cgroup.  My point is that it is currently completely unfair as I've 
stated: you can have 10000 processes attached to the root mem cgroup with 
rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each 
and the oom killer is going to target the leaf mem cgroup as a result of 
this apples vs oranges comparison.

In case it's not clear, the 10000 processes of 80MB rss each is the most 
likely contributor to a system-wide oom kill.  Unfortunately, the 
heuristic introduced by this patchset is broken wrt a fair comparison of 
the root mem cgroup usage.

> Or, if someone will come with an idea of a better approximation, it can be
> implemented as a separate enhancement on top of the initial implementation.
> This is more than welcome.
> 

We don't need a better approximation, we need a fair comparison.  The 
heuristic that this patchset is implementing is based on the usage of 
individual mem cgroups.  For the root mem cgroup to be considered 
eligible, we need to understand its usage.  That usage is _not_ what is 
implemented by this patchset, which is the largest rss of a single 
attached process.  This, in fact, is not an "approximation" at all.  In 
the example of 10000 processes attached with 80MB rss each, the usage of 
the root mem cgroup is _not_ 80MB.

I'll restate that oom killing a process is a last resort for the kernel, 
but it also must be able to make a smart decision.  Targeting dozens of 
1MB processes instead of 80MB processes because of a shortcoming in this 
implementation is not the appropriate selection, it's the opposite of the 
correct selection.

> > I'll reiterate what I did on the last version of the patchset: considering 
> > only leaf memory cgroups easily allows users to defeat this heuristic and 
> > bias against all of their memory usage up to the largest process size 
> > amongst the set of processes attached.  If the user creates N child mem 
> > cgroups for their N processes and attaches one process to each child, the 
> > _only_ thing this achieved is to defeat your heuristic and prefer other 
> > leaf cgroups simply because those other leaf cgroups did not do this.
> > 
> > Effectively:
> > 
> > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > 
> > will radically shift the heuristic from a score of all anonymous + 
> > unevictable memory for all processes to a score of the largest anonymous +
> > unevictable memory for a single process.  There is no downside or 
> > ramifaction for the end user in doing this.  When comparing cgroups based 
> > on usage, it only makes sense to compare the hierarchical usage of that 
> > cgroup so that attaching processes to descendants or splitting the 
> > implementation of a process into several smaller individual processes does 
> > not allow this heuristic to be defeated.
> 
> To all previously said words I can only add that cgroup v2 allows to limit
> the amount of cgroups in the sub-tree:
> 1a926e0bbab8 ("cgroup: implement hierarchy limits").
> 

So the solution to 

for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done

evading all oom kills for your mem cgroup is to limit the number of 
cgroups that can be created by the user?  With a unified cgroup hierarchy, 
that doesn't work well if I wanted to actually constrain these individual 
processes to different resource limits like cpu usage.  In fact, the user 
may not know it is effectively evading the oom killer entirely because it 
has constrained the cpu of individual processes because its a side-effect 
of this heuristic.

You chose not to respond to my reiteration of userspace having absolutely 
no control over victim selection with the new heuristic without setting 
all processes to be oom disabled via /proc/pid/oom_score_adj.  If I have a 
very important job that is running on a system that is really supposed to 
use 80% of memory, I need to be able to specify that it should not be oom 
killed based on user goals.  Setting all processes to be oom disabled in 
the important mem cgroup to avoid being oom killed unless absolutely 
necessary in a system oom condition is not a robust solution: (1) the mem 
cgroup livelocks if it reaches its own mem cgroup limit and (2) the system 
panic()'s if these preferred mem cgroups are the only consumers left on 
the system.  With overcommit, both of these possibilities exist in the 
wild and the problem is only a result of the implementation detail of this 
patchset.

For these reasons: unfair comparison of root mem cgroup usage to bias 
against that mem cgroup from oom kill in system oom conditions, the 
ability of users to completely evade the oom killer by attaching all 
processes to child cgroups either purposefully or unpurposefully, and the 
inability of userspace to effectively control oom victim selection:

Nacked-by: David Rientjes <rientjes@xxxxxxxxxx>

> > This is racy because mem_cgroup_select_oom_victim() found an eligible 
> > oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible 
> > process but mem_cgroup_scan_task(oc->chosen_memcg) did not.  It means if a 
> > process cannot be killed because of oom_unkillable_task(), the only 
> > eligible processes moved or exited, or the /proc/pid/oom_score_adj of the 
> > eligible processes changed, we end up falling back to the complete 
> > tasklist scan.  It would be better for oom_evaluate_memcg() to consider 
> > oom_unkillable_task() and also retry in the case where 
> > oom_kill_memcg_victim() returns NULL.
> 
> I agree with you here. The fallback to the existing mechanism is implemented
> to be safe for sure, especially in a case of a global OOM. When we'll get
> more confidence in cgroup-aware OOM killer reliability, we can change this
> behavior. Personally, I would prefer to get rid of looking at all tasks just
> to find a pre-existing OOM victim, but it might be quite tricky to implement.
> 

I'm not sure what this has to do with confidence in this patchset's 
reliability?  The race obviously exists: mem_cgroup_select_oom_victim() 
found an eligible process in oc->chosen_memcg but it was either ineligible 
later because of oom_unkillable_task(), it moved, or it exited.  It's a 
race.  For users who opt-in to this new heuristic, they should not be 
concerned with a process exiting and thus killing a completely unexpected 
process from an unexpected memcg when it should be possible to retry and 
select the correct victim.

It's much better to document and state to the user what they are opting-in 
to and clearly define how a victim is chosen with the new heuristic and 
then implement that so it works correctly.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html