Re: an argument for keeping oom_control in cgroups v2

Michal Hocko <mhocko@xxxxxxxx> · Tue, 23 Aug 2022 07:06:01 +0200

On Mon 22-08-22 17:22:53, Tejun Heo wrote:
> (cc'ing memcg folks for visiblity)
> 
> On Mon, Aug 22, 2022 at 08:04:02AM -0400, Chris Frey wrote:
> > In cgroups v1 we had:
> > 
> > 	memory.soft_limit_in_bytes
> > 	memory.limit_in_bytes
> > 	memory.memsw.limit_in_bytes
> > 	memory.oom_control
> > 
> > Using these features, we could achieve:
> > 
> > 	- cause programs that were memory hungry to suffer performance, but
> > 	  not stop (soft limit)

There is memory.high with a much more sensible semantic and
implementation to achieve a similar thing.

> > 	- cause programs to swap before the system actually ran out of memory
> > 	  (limit)

Not sure what this is supposed to mean.

> > 	- cause programs to be OOM-killed if they used too much swap
> > 	  (memsw.limit...)

There is an explicit swap limit. It is true that the semantic is
different but do you have an example where you cannot really achieve
what you need by the swap limit?

> > 
> > 	- cause programs to halt instead of get killed (oom_control)
> > 
> > That last feature is something I haven't seen duplicated in the settings
> > for cgroups v2.  In terms of handling a truly non-malicious memory hungry
> > program, it is a feature that has no equal, because the user may require
> > time to free up memory elsewhere before allocating more to the program,
> > and he may not want the performance degredation, nor the loss of work,
> > that comes from the other options.

Yes this functionality is not available in v2 anymore. One reason is
that the implementation had to be considerably reduced to only block on
OOM for user space triggered page faults 3812c8c8f395 ("mm: memcg: do
not trap chargers with full callstack on OOM"). The primary reason is,
as Tejun indicated, that we cannot simply block a random kernel code
path and wait for userspace because that is a potential DoS on the rest
of the system and unrelated workloads which is a trivial breakage of
workload separation.

This means that many other kernel paths which can cause memcg OOM cannot
be blocked and so the feature is severly crippled. In order to allow for
this feature we would essentially need a safe place to wait for the
userspace for any allocation (charging) kernel path where no locks are
held yet allocation failure is not observed and that is not feasible.

Hope this helps clarify
-- 
Michal Hocko
SUSE Labs