On Wed, Oct 11, 2017 at 01:21:47PM -0700, David Rientjes wrote: > On Tue, 10 Oct 2017, Roman Gushchin wrote: > > > > We don't need a better approximation, we need a fair comparison. The > > > heuristic that this patchset is implementing is based on the usage of > > > individual mem cgroups. For the root mem cgroup to be considered > > > eligible, we need to understand its usage. That usage is _not_ what is > > > implemented by this patchset, which is the largest rss of a single > > > attached process. This, in fact, is not an "approximation" at all. In > > > the example of 10000 processes attached with 80MB rss each, the usage of > > > the root mem cgroup is _not_ 80MB. > > > > It's hard to imagine a "healthy" setup with 10000 process in the root > > memory cgroup, and even if we kill 1 process we will still have 9999 > > remaining process. I agree with you at some point, but it's not > > a real world example. > > > > It's an example that illustrates the problem with the unfair comparison > between the root mem cgroup and leaf mem cgroups. It's unfair to compare > [largest rss of a single process attached to a cgroup] to > [anon + unevictable + unreclaimable slab usage of a cgroup]. It's not an > approximation, as previously stated: the usage of the root mem cgroup is > not 100MB if there are 10 such processes attached to the root mem cgroup, > it's off by orders of magnitude. > > For the root mem cgroup to be treated equally as a leaf mem cgroup as this > patchset proposes, it must have a fair comparison. That can be done by > accounting memory to the root mem cgroup in the same way it is to leaf mem > cgroups. > > But let's move the discussion forward to fix it. To avoid necessarily > accounting memory to the root mem cgroup, have we considered if it is even > necessary to address the root mem cgroup? For the users who opt-in to > this heuristic, would it be possible to discount the root mem cgroup from > the heuristic entirely so that oom kills originate from leaf mem cgroups? > Or, perhaps better, oom kill from non-memory.oom_group cgroups only if > the victim rss is greater than an eligible victim rss attached to the root > mem cgroup? David, I'm not pretending for implementing the best possible accounting for the root memory cgroup, and I'm sure there is a place for further enhancement. But if it's not leading to some obviously stupid victim selection (like ignoring leaking task, which consumes most of the memory), I don't see why it should be treated as a blocker for the whole patchset. I also doubt that any of us has these examples, and the best way to get them is to get some real usage feedback. Ignoring oom_score_adj, subtracting leaf usage sum from system usage etc, these all are perfect ideas which can be implemented on top of this patchset. > > > > For these reasons: unfair comparison of root mem cgroup usage to bias > > > against that mem cgroup from oom kill in system oom conditions, the > > > ability of users to completely evade the oom killer by attaching all > > > processes to child cgroups either purposefully or unpurposefully, and the > > > inability of userspace to effectively control oom victim selection: > > > > > > Nacked-by: David Rientjes <rientjes@xxxxxxxxxx> > > > > So, if we'll sum the oom_score of tasks belonging to the root memory cgroup, > > will it fix the problem? > > > > It might have some drawbacks as well (especially around oom_score_adj), > > but it's doable, if we'll ignore tasks which are not owners of their's mm struct. > > > > You would be required to discount oom_score_adj because the heuristic > doesn't account for oom_score_adj when comparing the anon + unevictable + > unreclaimable slab of leaf mem cgroups. This wouldn't result in the > correct victim selection in real-world scenarios where processes attached > to the root mem cgroup are vital to the system and not part of any user > job, i.e. they are important system daemons and the "activity manager" > responsible for orchestrating the cgroup hierarchy. > > It's also still unfair because it now compares > [sum of rss of processes attached to a cgroup] to > [anon + unevictable + unreclaimable slab usage of a cgroup]. RSS isn't > going to be a solution, regardless if its one process or all processes, if > it's being compared to more types of memory in leaf cgroups. > > If we really don't want root mem cgroup accounting so this is a fair > comparison, I think the heuristic needs to special case the root mem > cgroup either by discounting root oom kills if there are eligible oom > kills from leaf cgroups (the user would be opting-in to this behavior) or > comparing the badness of a victim from a leaf cgroup to the badness of a > victim from the root cgroup when deciding which to kill and allow the user > to protect root mem cgroup processes with oom_score_adj. > > That aside, all of this has only addressed one of the three concerns with > the patchset. > > I believe the solution to avoid allowing users to circumvent oom kill is > to account usage up the hierarchy as you have done in the past. Cgroup > hierarchies can be managed by the user so they can create their own > subcontainers, this is nothing new, and I would hope that you wouldn't > limit your feature to only a very specific set of usecases. That may be > your solution for the root mem cgroup itself: if the hierarchical usage of > all top-level mem cgroups is known, it's possible to find the root mem > cgroup usage by subtraction, you are using stats that are global vmstats > in your heuristic. > > Accounting usage up the hierarchy avoids the first two concerns with the > patchset. It allows you to implicitly understand the usage of the root > mem cgroup itself, and does not allow users to circumvent oom kill by > creating subcontainers, either purposefully or not. The third concern, > userspace influence, can allow users to attack leaf mem cgroups deeper in > the tree if it is using more memory than expected, but the hierarchical > usage is lower at the top-level. That is the only objection that I have > seen to using hierarchical usage: there may be a single cgroup deeper in > the tree that avoids oom kill because another hierarchy has a higher > usage. This can trivially be addressed either by oom priorities or an > adjustment, just like oom_score_adj, on cgroup usage. As I've said, I barely understand how the exact implementation of root memory cgroup accounting is considered a blocker for the whole feature. The same is true for oom priorities: it's something that can and should be implemented on top of the basic semantics, introduced by this patchset. So, the only real question is the way how we find a victim memcg in the subtree: by performing independent election on each level or by searching tree-wide. We all had many discussion around, and as you remember, initially I was supporting the first option. But then Michal provided a very strong argument: if you have 3 similar workloads in A, B and C, but for non-memory-related reasons (e.g. cpu time sharing) you have to join A and B into a group D: /\ D C / \ A B it's strange to penalize A and B for it. It looks to me that you're talking about the similar case, but you consider this hierarchy useful. So, overall, it seems to be depending on exact configuration. I have to add, that if you can enable memory.oom_group, your problem doesn't exist. The selected approach is easy extendable into hierarchical direction: as I've said before, we can introduce a new value of memory.oom_group, which will enable cumulative accounting without mass killing. And, tbh, I don't see how oom_priorities will resolve an opposite problem if we'd take the hierarchical approach. Thanks! -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html