On Tue, Aug 23, 2022 at 07:06:01AM +0200, Michal Hocko wrote: > On Mon 22-08-22 17:22:53, Tejun Heo wrote: > > (cc'ing memcg folks for visiblity) > > > > On Mon, Aug 22, 2022 at 08:04:02AM -0400, Chris Frey wrote: > > > In cgroups v1 we had: > > > > > > memory.soft_limit_in_bytes > > > memory.limit_in_bytes > > > memory.memsw.limit_in_bytes > > > memory.oom_control > > > > > > Using these features, we could achieve: > > > > > > - cause programs that were memory hungry to suffer performance, but > > > not stop (soft limit) > > There is memory.high with a much more sensible semantic and > implementation to achieve a similar thing. > > > > - cause programs to swap before the system actually ran out of memory > > > (limit) > > Not sure what this is supposed to mean. > > > > - cause programs to be OOM-killed if they used too much swap > > > (memsw.limit...) > > > There is an explicit swap limit. It is true that the semantic is > different but do you have an example where you cannot really achieve > what you need by the swap limit? > > > > > > > - cause programs to halt instead of get killed (oom_control) > > > > > > That last feature is something I haven't seen duplicated in the settings > > > for cgroups v2. In terms of handling a truly non-malicious memory hungry > > > program, it is a feature that has no equal, because the user may require > > > time to free up memory elsewhere before allocating more to the program, > > > and he may not want the performance degredation, nor the loss of work, > > > that comes from the other options. > > Yes this functionality is not available in v2 anymore. One reason is > that the implementation had to be considerably reduced to only block on > OOM for user space triggered page faults 3812c8c8f395 ("mm: memcg: do > not trap chargers with full callstack on OOM"). The primary reason is, > as Tejun indicated, that we cannot simply block a random kernel code > path and wait for userspace because that is a potential DoS on the rest > of the system and unrelated workloads which is a trivial breakage of > workload separation. > > This means that many other kernel paths which can cause memcg OOM cannot > be blocked and so the feature is severly crippled. In order to allow for > this feature we would essentially need a safe place to wait for the > userspace for any allocation (charging) kernel path where no locks are > held yet allocation failure is not observed and that is not feasible. Btw, it's fairly easy to emulate the oom_control behaviour using cgroups v2: a userspace agent can listen to memory.high/max events and use the cgroup v2 freezer to stop the workload and handle the oom in v1 oom_control style. An agent can have a high/real-time priority, so I guess the behavior will be actually quite close to the v1 experience. Much safer though. Thanks!