On Wed, 1 Aug 2018, Roman Gushchin wrote: > Ok, I think that what we'll do here: > 1) drop the current cgroup-aware OOM killer implementation from the mm tree > 2) land memory.oom.group to the mm tree (your ack will be appreciated) > 3) discuss and, hopefully, agree on memory.oom.policy interface > 4) land memory.oom.policy > Yes, I'm fine proceeding this way, there's a clear separation between the policy and mechanism and they can be introduced independent of each other. As I said in my patchset, we can also introduce policies independent of each other and I have no objection to your design that addresses your specific usecase, with your own policy decisions, with the added caveat that we do so in a way that respects other usecases. Specifically, I would ask that the following be respected: - Subtrees delegated to users can still operate as they do today with per-process selection (largest, or influenced by oom_score_adj) so their victim selection is not changed out from under them. This requires the entire hierarchy is not locked into a specific policy, and also that a subtree is not locked in a specific policy. In other words, if an oom condition occurs in a user-controlled subtree they have the ability to get the same selection criteria as they do today. - Policies are implemented in a way that has an extensible API so that we do not unnecessarily limit or prohibit ourselves from making changes in the future or from extending the functionality by introducing other policy choices that are needed in the future. I hope that I'm not being unrealistic in assuming that you're fine with these since it can still preserve your goals. > Basically, with oom.group separated everything we need is another > boolean knob, which means that the memcg should be evaluated together. In a cgroup-aware oom killer world, yes, we need the ability to specify that the usage of the entire subtree should be compared as a single entity with other cgroups. That is necessary for user subtrees but may not be necessary for top-level cgroups depending on how you structure your unified cgroup hierarchy. So it needs to be configurable, as you suggest, and you are correct it can be different than oom.group. That's not the only thing we need though, as I'm sure you were expecting me to say :) We need the ability to preserve existing behavior, i.e. process based and not cgroup aware, for subtrees so that our users who have clear expectations and tune their oom_score_adj accordingly based on how the oom killer has always chosen processes for oom kill do not suddenly regress. So we need to define the policy for a subtree that is oom, and I suggest we do that as a characteristic of the cgroup that is oom ("process" vs "cgroup", and process would be the default to preserve what currently happens in a user subtree). Now, as users who rely on process selection are well aware, we have oom_score_adj to influence the decision of which process to oom kill. If our oom subtree is cgroup aware, we should have the ability to likewise influence that decision. For example, we have high priority applications that run at the top-level that use a lot of memory and strictly oom killing them in all scenarios because they use a lot of memory isn't appropriate. We need to be able to adjust the comparison of a cgroup (or subtree) when compared to other cgroups. I've also suggested, but did not implement in my patchset because I was trying to define the API and find common ground first, that we have a need for priority based selection. In other words, define the priority of a subtree regardless of cgroup usage. So with these four things, we have - an "oom.policy" tunable to define "cgroup" or "process" for that subtree (and plans for "priority" in the future), - your "oom.evaluate_as_group" tunable to account the usage of the subtree as the cgroup's own usage for comparison with others, - an "oom.adj" to adjust the usage of the cgroup (local or subtree) to protect important applications and bias against unimportant applications. This adds several tunables, which I didn't like, so I tried to overload oom.policy and oom.evaluate_as_group. When I referred to separating out the subtree usage accounting into a separate tunable, that is what I have referenced above. So when a cgroup is oom, oom.policy defines the selection. The cgroup here could be root for when the system is oom. If "process", nothing else matters, we iterate and find the largest process (modulo oom_score_adj) and kill it. We then look at oom.group and determine if additional processes should be oom killed. If "cgroup", we determine the local usage of each cgroup in the subtree. If oom.evaluate_as_group is enabled for a cgroup, we add the usage from each cgroup in the subtree to that cgroup. We then add oom.adj, which can be positive or negative, for the cgroup's overall score. Each cgroup then has a score that can be compared fairly to one another and the oom kill can occur. We then look at oom.group and determine if additional processes should be oom killed. With plans for an oom.policy of "priority", I would define that priority in oom.adj. Here, oom.evaluate_as_group can still be useful, which is great. If smaller priorities means higher preference for oom kill, we compare the oom.adj of all direct children and iterate the smallest. If oom.evaluate_as_group is set, the smallest oom.adj from the subtree is used. This is how I envisioned the functionality of the cgroup aware oom killer when I wrote my patchset and would be happy to hear your input or suggestions on it.