On Mon, Oct 02, 2017 at 02:24:34PM +0200, Michal Hocko wrote: > On Sun 01-10-17 16:29:48, Shakeel Butt wrote: > > > > > > Going back to Michal's example, say the user configured the following: > > > > > > root > > > / \ > > > A D > > > / \ > > > B C > > > > > > A global OOM event happens and we find this: > > > - A > D > > > - B, C, D are oomgroups > > > > > > What the user is telling us is that B, C, and D are compound memory > > > consumers. They cannot be divided into their task parts from a memory > > > point of view. > > > > > > However, the user doesn't say the same for A: the A subtree summarizes > > > and controls aggregate consumption of B and C, but without groupoom > > > set on A, the user says that A is in fact divisible into independent > > > memory consumers B and C. > > > > > > If we don't have to kill all of A, but we'd have to kill all of D, > > > does it make sense to compare the two? > > > > > > > I think Tim has given very clear explanation why comparing A & D makes > > perfect sense. However I think the above example, a single user system > > where a user has designed and created the whole hierarchy and then > > attaches different jobs/applications to different nodes in this > > hierarchy, is also a valid scenario. > > Yes and nobody is disputing that, really. I guess the main disconnect > here is that different people want to have more detailed control over > the victim selection while the patchset tries to handle the most > simplistic scenario when a no userspace control over the selection is > required. And I would claim that this will be a last majority of setups > and we should address it first. > > A more fine grained control needs some more thinking to come up with a > sensible and long term sustainable API. Just look back and see at the > oom_score_adj story and how it ended up unusable in the end (well apart > from never/always kill corner cases). Let's not repeat that again now. > > I strongly believe that we can come up with something - be it priority > based, BFP based or module based selection. But let's start simple with > the most basic scenario first with a most sensible semantic implemented. Totally agree. > I believe the latest version (v9) looks sensible from the semantic point > of view and we should focus on making it into a mergeable shape. The only thing is that after some additional thinking I don't think anymore that implicit propagation of oom_group is a good idea. Let me explain: assume we have memcg A with memory.max and memory.oom_group set, and nested memcg A/B with memory.max set. Let's imagine we have an OOM event if A/B. What is an expected system behavior? We have OOM scoped to A/B, and any action should be also scoped to A/B. We really shouldn't touch processes which are not belonging to A/B. That means we should either kill the biggest process in A/B, either all processes in A/B. It's natural to make A/B/memory.oom_group responsible for this decision. It's strange to make the depend on A/memory.oom_group, IMO. It really makes no sense, and makes oom_group knob really hard to describe. Also, after some off-list discussion, we've realized that memory.oom_knob should be delegatable. The workload should have control over it to express dependency between processes. Thanks! -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html