On Wed, 31 Jan 2018, Michal Hocko wrote: > > > > > root > > > > > / | \ > > > > > A B C > > > > > / \ / \ > > > > > D E F G > > > > > > > > > > Assume A: cgroup, B: oom_group=1, C: tree, G: oom_group=1 > > > > > > > > > > > > > At each level of the hierarchy, memory.oom_policy compares immediate > > > > children, it's the only way that an admin can lock in a specific oom > > > > policy like "tree" and then delegate the subtree to the user. If you've > > > > configured it as above, comparing A and C should be the same based on the > > > > cumulative usage of their child mem cgroups. > > > > It seems I am still not clear with my question. What kind of difference > does policy=cgroup vs. none on A? Also what kind of different does it > make when a leaf node has cgroup policy? > If A has an oom policy of "cgroup" it will be comparing the local usage of D vs E, "tree" would be the same since neither descendants have child cgroups. If A has an oom policy of "none" it would compare processes attached to D and E and respect /proc/pid/oom_score_adj. It allows for opting in and opting out of cgroup aware selection not only for the whole system but also per subtree. > > Hmm, I'm not sure why we would limit memory.oom_group to any policy. Even > > if we are selecting a process, even without selecting cgroups as victims, > > killing a process may still render an entire cgroup useless and it makes > > sense to kill all processes in that cgroup. If an unlucky process is > > selected with today's heursitic of oom_badness() or with a "none" policy > > with my patchset, I don't see why we can't enable the user to kill all > > other processes in the cgroup. It may not make sense for some trees, but > > but I think it could be useful for others. > > My intuition screams here. I will think about this some more but I would > be really curious about any sensible usecase when you want sacrifice the > whole gang just because of one process compared to other processes or > cgroups is too large. Do you see how you are mixing entities here? > It's a property of the workload that has nothing to do with selection. Regardless of how a victim is selected, we need a victim. That victim may be able to tolerate the loss of the process, and not even need to be the largest memory hogging process based on /proc/pid/oom_score_adj (periodic cleanups, logging, stat collection are what I'm most familiar with). It may also be vital to the workload and it's better off to kill the entire job, it's highly dependent on what the job is. There's a general usecase for memory.oom_group behavior without any selection changes, we've had a killall tunable for years and is used by many customers for the same reason. There's no reason for it to be coupled, it can exist independent of any cgroup aware selection. > I do not understand. Get back to our example. Are you saying that G > with none will enforce the none policy to C and root? If yes then this > doesn't make any sense because you are not really able to delegate the > oom policy down the tree at all. It would effectively make tree policy > pointless. > The oom policy of G is pointless, it has no children cgroups. It can be "none", "cgroup", or "tree", it's not the root of a subtree. (The oom policy of the root mem cgroup is also irrelevant if there are no other cgroups, same thing.) If G is oom, it kills the largest process or everything if memory.oom_group is set, which in your example it is. > I am skipping the rest of the following text because it is picking > on details and the whole design is not clear to me. So could you start > over documenting semantic and requirements. Ideally by describing: > - how does the policy on the root of the OOM hierarchy controls the > selection policy If "none", there's no difference than Linus's tree right now. If "cgroup", it enables cgroup aware selection: it compares all cgroups on the system wrt local usage unless that cgroup has "tree" set in which case its usage is hierarchical. > - how does the per-memcg policy act during the tree walk - for both > intermediate nodes and leafs The oom policy is determined by the mem cgroup under oom, that is the root of the subtree under oom and its policy dictates how to select a victim mem cgroup. > - how does the oom killer act based on the selected memcg That's the point about memory.oom_group: once it has selected a cgroup (if cgroup aware behavior is enabled for the oom subtree [could be root]), a memory hogging process attached to that subtree is killed or everything is killed if memory.oom_group is enabled. > - how do you compare tasks with memcgs > You don't, I think the misunderstanding is what happens if the root of a subtree is "cgroup", for example, and a descendant has "none" enabled. The root is under oom, it is comparing cgroups :) "None" is only effective if that subtree root is oom where process usage is considered. The point is that all the functionality available in -mm is still available, just dictate "cgroup" everywhere and make it a decision that can change per subtree, if necessary, without any mount option that would become obsoleted. Then, make memory.oom_group possible without any specific selection policy since its useful on its own. Let me give you a concrete example based on your earlier /admins, /teachers, /students example. We oversubscribe the /students subtree in the case where /admins and /teachers aren't using the memory. We say 100 students can use 1GB each, but the limit of /students is actually 200GB. 100 students using 1GB each won't cause a system oom, we control that by the limit of /admins and /teachers. But we allow using memory that isn't in use by /admins and /teachers if it's there, opening up overconsumers to the possibility of oom kill. (This is a real world example with batch job scheduling, it's anything but hypothetical.) /students/michal logs in, and he has complete control over his subtree. He's going to start several jobs, all in their own cgroups, with usage well over 1GB, but if he's oom killed he wants the right thing oom killed. Obviously this completely breaks down if the -mm functionality is used if you have 10 jobs using 512MB each, because another student using more than 1GB who isn't using cgroups is going to be oom killed instead, although you are using 5GB. We've discussed that ad nauseam, and is why I introduced "tree". But now look at the API. /students/michal is using child cgroups but which selection policy is in effect? Will it kill the most memory hogging process in his subtree or the most memory hogging process from the most memory hogging cgroup? It's an important distinction because it's directly based on how he constructs his hierarchy: if locked into one selection logic, the least important job *must* be in the highest consuming cgroup; otherwise, his /proc/pid/oom_score_adj is respected. He *must* query the mount option. But now let's say that memory.oom_policy is merged to give this control to him to do per process, per cgroup, or per subtree oom killing based on how he defines it. The mount option doesn't mean anything anymore, in fact, it can mean the complete opposite of what actually happens. That's the direct objection to the mount option. Since I have systems with thousands of cgroups in hundreds of cgroups and over 100 workgroups that define, sometimes very creatively, how to select oom victims, I'm an advocate for an extensible interface that is useful for general purpose, doesn't remove any functionality, and doesn't have contradicting specifications. -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html