On Mon, Aug 06, 2018 at 02:34:06PM -0700, David Rientjes wrote: > On Wed, 1 Aug 2018, Roman Gushchin wrote: > > > Ok, I think that what we'll do here: > > 1) drop the current cgroup-aware OOM killer implementation from the mm tree > > 2) land memory.oom.group to the mm tree (your ack will be appreciated) > > 3) discuss and, hopefully, agree on memory.oom.policy interface > > 4) land memory.oom.policy > > > > Yes, I'm fine proceeding this way, there's a clear separation between the > policy and mechanism and they can be introduced independent of each other. > As I said in my patchset, we can also introduce policies independent of > each other and I have no objection to your design that addresses your > specific usecase, with your own policy decisions, with the added caveat > that we do so in a way that respects other usecases. > > Specifically, I would ask that the following be respected: > > - Subtrees delegated to users can still operate as they do today with > per-process selection (largest, or influenced by oom_score_adj) so > their victim selection is not changed out from under them. This > requires the entire hierarchy is not locked into a specific policy, > and also that a subtree is not locked in a specific policy. In other > words, if an oom condition occurs in a user-controlled subtree they > have the ability to get the same selection criteria as they do today. > > - Policies are implemented in a way that has an extensible API so that > we do not unnecessarily limit or prohibit ourselves from making changes > in the future or from extending the functionality by introducing other > policy choices that are needed in the future. > > I hope that I'm not being unrealistic in assuming that you're fine with > these since it can still preserve your goals. > > > Basically, with oom.group separated everything we need is another > > boolean knob, which means that the memcg should be evaluated together. > > In a cgroup-aware oom killer world, yes, we need the ability to specify > that the usage of the entire subtree should be compared as a single > entity with other cgroups. That is necessary for user subtrees but may > not be necessary for top-level cgroups depending on how you structure your > unified cgroup hierarchy. So it needs to be configurable, as you suggest, > and you are correct it can be different than oom.group. > > That's not the only thing we need though, as I'm sure you were expecting > me to say :) > > We need the ability to preserve existing behavior, i.e. process based and > not cgroup aware, for subtrees so that our users who have clear > expectations and tune their oom_score_adj accordingly based on how the oom > killer has always chosen processes for oom kill do not suddenly regress. Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing this case? This basically means that if memcg is selected as target, the process inside will be selected using traditional per-process approach. > So we need to define the policy for a subtree that is oom, and I suggest > we do that as a characteristic of the cgroup that is oom ("process" vs > "cgroup", and process would be the default to preserve what currently > happens in a user subtree). I'm not entirely convinced here. I do agree, that some sub-tree may have a well tuned oom_score_adj, and it's preferable to keep the current behavior. At the same time I don't like the idea to look at the policy of the OOMing cgroup. Why exceeding of one limit should be handled different to exceeding of another? This seems to be a property of workload, not a limit. > > Now, as users who rely on process selection are well aware, we have > oom_score_adj to influence the decision of which process to oom kill. If > our oom subtree is cgroup aware, we should have the ability to likewise > influence that decision. For example, we have high priority applications > that run at the top-level that use a lot of memory and strictly oom > killing them in all scenarios because they use a lot of memory isn't > appropriate. We need to be able to adjust the comparison of a cgroup (or > subtree) when compared to other cgroups. > > I've also suggested, but did not implement in my patchset because I was > trying to define the API and find common ground first, that we have a need > for priority based selection. In other words, define the priority of a > subtree regardless of cgroup usage. > > So with these four things, we have > > - an "oom.policy" tunable to define "cgroup" or "process" for that > subtree (and plans for "priority" in the future), > > - your "oom.evaluate_as_group" tunable to account the usage of the > subtree as the cgroup's own usage for comparison with others, > > - an "oom.adj" to adjust the usage of the cgroup (local or subtree) > to protect important applications and bias against unimportant > applications. > > This adds several tunables, which I didn't like, so I tried to overload > oom.policy and oom.evaluate_as_group. When I referred to separating out > the subtree usage accounting into a separate tunable, that is what I have > referenced above. IMO, merging multiple tunables into one doesn't make it saner. The real question how to make a reasonable interface with fever tunables. The reason behind introducing all these knobs is to provide a generic solution to define OOM handling rules, but then the question raises if the kernel is the best place for it. I really doubt that an interface with so many knobs has any chances to be merged. IMO, there should be a compromise between the simplicity (basically, the number of tunables and possible values) and functionality of the interface. You nacked my previous version, and unfortunately I don't have anything better so far. Thanks!