On Sat, 20 Jan 2018, Tejun Heo wrote: > > Hearing no response, I'll implement this as a separate tunable in a v2 > > series assuming there are no better ideas proposed before next week. One > > of the nice things about a separate tunable is that an admin can control > > the overall policy and they can delegate the mechanism (killall vs one > > process) to a user subtree. I agree with your earlier point that killall > > vs one process is a property of the workload and is better defined > > separately. > > If I understood your arguments correctly, the reasons that you thought > your selectdion policy changes must go together with Roman's victim > action were two-fold. > > 1. You didn't want a separate knob for group oom behavior and wanted > it to be combined with selection policy. I'm glad that you now > recognize that this would be the wrong design choice. > The memory.oom_action (or mechanism) file that I've proposed is different than memory.oom_group: we want to provide a non-binary tunable to specify what action that oom killer should effect. That could be to kill all processes in the subtree, similar to memory.oom_group, the local cgroup, or a different mechanism. I could propose the patchset backwards, if necessary, because memory.oom_group is currently built upon a broken selection heuristic. In other words, I could propose a memory.oom_action that can specify two different mechanisms that would be useful outside of any different selection function. However, since the mechanism is built on top of the cgroup aware oom killer's policy, we can't merge it currently without the broken logic. > 2. The current selection policy may be exploited by delegatee and > strictly hierarchical seleciton should be available. We can debate > the pros and cons of different heuristics; however, to me, the > followings are clear. > > * Strictly hierarchical approach can't replace the current policy. > It doesn't work well for a lot of use cases. > -ECONFUSED. I haven't proposed any strict hierarchical approach here, it's configurable by the user. > * OOM victim selection policy has always been subject to changes > and improvements. > That's fine, but the selection policy introduced by any cgroup aware oom killer is being specified now and is pretty clear cut: compare the usage of cgroups equally based on certain criteria and choose the largest. I don't see how that could be changed or improved once the end user starts using it, the heuristic has become a policy. A single cgroup aware policy doesn't work, if you don't have localized, per-cgroup control you have Michal's /admins and /students example; if you don't have hierarchical, subtree control you have my example of users intentionally/unintentionally evading the selection logic based on using cgroups. The policy needs to be defined for subtrees. There hasn't been any objection to that, so introducing functionality that adds a completely unnecessary and broken mount option and forces users to configure cgroups in a certain manner that no longer exists with subtree control doesn't seem helpful. > I don't see any blocker here. The issue you're raising can and should > be handled separately. > It can't, because the current patchset locks the system into a single selection criteria that is unnecessary and the mount option would become a no-op after the policy per subtree becomes configurable by the user as part of the hierarchy itself. > Here, whether a workload can survive being killed piece-wise or not is > an inherent property of the workload and a pretty binary one at that. > I'm not necessarily against changing it to take string inputs but > don't see rationales for doing so yet. > We don't need the unnecessary level in the cgroup hierarchy that enables memory.oom_group, as proposed, and serves no other purpose. It's perfectly valid for subtrees to run user executors whereas a "killall" mechanism is valid for some workloads and not others. We do not need ancestor cgroups locking that decision into place. -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html