Re: an argument for keeping oom_control in cgroups v2

Roman Gushchin <roman.gushchin@xxxxxxxxx> · Tue, 23 Aug 2022 09:10:37 -0700

On Tue, Aug 23, 2022 at 07:06:01AM +0200, Michal Hocko wrote:
> On Mon 22-08-22 17:22:53, Tejun Heo wrote:
> > (cc'ing memcg folks for visiblity)
> > 
> > On Mon, Aug 22, 2022 at 08:04:02AM -0400, Chris Frey wrote:
> > > In cgroups v1 we had:
> > > 
> > > 	memory.soft_limit_in_bytes
> > > 	memory.limit_in_bytes
> > > 	memory.memsw.limit_in_bytes
> > > 	memory.oom_control
> > > 
> > > Using these features, we could achieve:
> > > 
> > > 	- cause programs that were memory hungry to suffer performance, but
> > > 	  not stop (soft limit)
> 
> There is memory.high with a much more sensible semantic and
> implementation to achieve a similar thing.
> 
> > > 	- cause programs to swap before the system actually ran out of memory
> > > 	  (limit)
> 
> Not sure what this is supposed to mean.
> 
> > > 	- cause programs to be OOM-killed if they used too much swap
> > > 	  (memsw.limit...)
> 
> 
> There is an explicit swap limit. It is true that the semantic is
> different but do you have an example where you cannot really achieve
> what you need by the swap limit?
> 
> > > 
> > > 	- cause programs to halt instead of get killed (oom_control)
> > > 
> > > That last feature is something I haven't seen duplicated in the settings
> > > for cgroups v2.  In terms of handling a truly non-malicious memory hungry
> > > program, it is a feature that has no equal, because the user may require
> > > time to free up memory elsewhere before allocating more to the program,
> > > and he may not want the performance degredation, nor the loss of work,
> > > that comes from the other options.
> 
> Yes this functionality is not available in v2 anymore. One reason is
> that the implementation had to be considerably reduced to only block on
> OOM for user space triggered page faults 3812c8c8f395 ("mm: memcg: do
> not trap chargers with full callstack on OOM"). The primary reason is,
> as Tejun indicated, that we cannot simply block a random kernel code
> path and wait for userspace because that is a potential DoS on the rest
> of the system and unrelated workloads which is a trivial breakage of
> workload separation.
> 
> This means that many other kernel paths which can cause memcg OOM cannot
> be blocked and so the feature is severly crippled. In order to allow for
> this feature we would essentially need a safe place to wait for the
> userspace for any allocation (charging) kernel path where no locks are
> held yet allocation failure is not observed and that is not feasible.

Btw, it's fairly easy to emulate the oom_control behaviour using cgroups v2:
a userspace agent can listen to memory.high/max events and use the cgroup v2
freezer to stop the workload and handle the oom in v1 oom_control style.
An agent can have a high/real-time priority, so I guess the behavior will be
actually quite close to the v1 experience. Much safer though.

Thanks!