On Tue, 10 Oct 2017, Roman Gushchin wrote: > > We don't need a better approximation, we need a fair comparison. The > > heuristic that this patchset is implementing is based on the usage of > > individual mem cgroups. For the root mem cgroup to be considered > > eligible, we need to understand its usage. That usage is _not_ what is > > implemented by this patchset, which is the largest rss of a single > > attached process. This, in fact, is not an "approximation" at all. In > > the example of 10000 processes attached with 80MB rss each, the usage of > > the root mem cgroup is _not_ 80MB. > > It's hard to imagine a "healthy" setup with 10000 process in the root > memory cgroup, and even if we kill 1 process we will still have 9999 > remaining process. I agree with you at some point, but it's not > a real world example. > It's an example that illustrates the problem with the unfair comparison between the root mem cgroup and leaf mem cgroups. It's unfair to compare [largest rss of a single process attached to a cgroup] to [anon + unevictable + unreclaimable slab usage of a cgroup]. It's not an approximation, as previously stated: the usage of the root mem cgroup is not 100MB if there are 10 such processes attached to the root mem cgroup, it's off by orders of magnitude. For the root mem cgroup to be treated equally as a leaf mem cgroup as this patchset proposes, it must have a fair comparison. That can be done by accounting memory to the root mem cgroup in the same way it is to leaf mem cgroups. But let's move the discussion forward to fix it. To avoid necessarily accounting memory to the root mem cgroup, have we considered if it is even necessary to address the root mem cgroup? For the users who opt-in to this heuristic, would it be possible to discount the root mem cgroup from the heuristic entirely so that oom kills originate from leaf mem cgroups? Or, perhaps better, oom kill from non-memory.oom_group cgroups only if the victim rss is greater than an eligible victim rss attached to the root mem cgroup? > > For these reasons: unfair comparison of root mem cgroup usage to bias > > against that mem cgroup from oom kill in system oom conditions, the > > ability of users to completely evade the oom killer by attaching all > > processes to child cgroups either purposefully or unpurposefully, and the > > inability of userspace to effectively control oom victim selection: > > > > Nacked-by: David Rientjes <rientjes@xxxxxxxxxx> > > So, if we'll sum the oom_score of tasks belonging to the root memory cgroup, > will it fix the problem? > > It might have some drawbacks as well (especially around oom_score_adj), > but it's doable, if we'll ignore tasks which are not owners of their's mm struct. > You would be required to discount oom_score_adj because the heuristic doesn't account for oom_score_adj when comparing the anon + unevictable + unreclaimable slab of leaf mem cgroups. This wouldn't result in the correct victim selection in real-world scenarios where processes attached to the root mem cgroup are vital to the system and not part of any user job, i.e. they are important system daemons and the "activity manager" responsible for orchestrating the cgroup hierarchy. It's also still unfair because it now compares [sum of rss of processes attached to a cgroup] to [anon + unevictable + unreclaimable slab usage of a cgroup]. RSS isn't going to be a solution, regardless if its one process or all processes, if it's being compared to more types of memory in leaf cgroups. If we really don't want root mem cgroup accounting so this is a fair comparison, I think the heuristic needs to special case the root mem cgroup either by discounting root oom kills if there are eligible oom kills from leaf cgroups (the user would be opting-in to this behavior) or comparing the badness of a victim from a leaf cgroup to the badness of a victim from the root cgroup when deciding which to kill and allow the user to protect root mem cgroup processes with oom_score_adj. That aside, all of this has only addressed one of the three concerns with the patchset. I believe the solution to avoid allowing users to circumvent oom kill is to account usage up the hierarchy as you have done in the past. Cgroup hierarchies can be managed by the user so they can create their own subcontainers, this is nothing new, and I would hope that you wouldn't limit your feature to only a very specific set of usecases. That may be your solution for the root mem cgroup itself: if the hierarchical usage of all top-level mem cgroups is known, it's possible to find the root mem cgroup usage by subtraction, you are using stats that are global vmstats in your heuristic. Accounting usage up the hierarchy avoids the first two concerns with the patchset. It allows you to implicitly understand the usage of the root mem cgroup itself, and does not allow users to circumvent oom kill by creating subcontainers, either purposefully or not. The third concern, userspace influence, can allow users to attack leaf mem cgroups deeper in the tree if it is using more memory than expected, but the hierarchical usage is lower at the top-level. That is the only objection that I have seen to using hierarchical usage: there may be a single cgroup deeper in the tree that avoids oom kill because another hierarchy has a higher usage. This can trivially be addressed either by oom priorities or an adjustment, just like oom_score_adj, on cgroup usage. -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html