On Tue, Oct 24, 2017 at 07:55:58PM +0200, Michal Hocko wrote: > On Tue 24-10-17 13:23:30, Johannes Weiner wrote: > > On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote: > [...] > > > What would prevent a runaway in case the only process in the memcg is > > > oom unkillable then? > > > > In such a scenario, the page fault handler would busy-loop right now. > > > > Disabling oom kills is a privileged operation with dire consequences > > if used incorrectly. You can panic the kernel with it. Why should the > > cgroup OOM killer implement protective semantics around this setting? > > Breaching the limit in such a setup is entirely acceptable. > > > > Really, I think it's an enormous mistake to start modeling semantics > > based on the most contrived and non-sensical edge case configurations. > > Start the discussion with what is sane and what most users should > > optimally experience, and keep the cornercases simple. > > I am not really seeing your concern about the semantic. The most > important property of the hard limit is to protect from runaways and > stop them if they happen. Users can use the softer variant (high limit) > if they are not afraid of those scenarios. It is not so insane to > imagine that a master task (which I can easily imagine would be oom > disabled) has a leak and runaway as a result. Then you're screwed either way. Where do you return -ENOMEM in a page fault path that cannot OOM kill anything? Your choice is between maintaining the hard limit semantics or going into an infinite loop. I fail to see how this setup has any impact on the semantics we pick here. And even if it were real, it's really not what most users do. > We are not talking only about the page fault path. There are other > allocation paths to consume a lot of memory and spill over and break > the isolation restriction. So it makes much more sense to me to fail > the allocation in such a situation rather than allow the runaway to > continue. Just consider that such a situation shouldn't happen in > the first place because there should always be an eligible task to > kill - who would own all the memory otherwise? Okay, then let's just stick to the current behavior.