On Mon, Jul 30, 2018 at 11:01:00AM -0700, Roman Gushchin wrote: > For some workloads an intervention from the OOM killer > can be painful. Killing a random task can bring > the workload into an inconsistent state. > > Historically, there are two common solutions for this > problem: > 1) enabling panic_on_oom, > 2) using a userspace daemon to monitor OOMs and kill > all outstanding processes. > > Both approaches have their downsides: > rebooting on each OOM is an obvious waste of capacity, > and handling all in userspace is tricky and requires > a userspace agent, which will monitor all cgroups > for OOMs. > > In most cases an in-kernel after-OOM cleaning-up > mechanism can eliminate the necessity of enabling > panic_on_oom. Also, it can simplify the cgroup > management for userspace applications. > > This commit introduces a new knob for cgroup v2 memory > controller: memory.oom.group. The knob determines > whether the cgroup should be treated as a single > unit by the OOM killer. If set, the cgroup and its > descendants are killed together or not at all. > > To determine which cgroup has to be killed, we do > traverse the cgroup hierarchy from the victim task's > cgroup up to the OOMing cgroup (or root) and looking > for the highest-level cgroup with memory.oom.group set. > > Tasks with the OOM protection (oom_score_adj set to -1000) > are treated as an exception and are never killed. > > This patch doesn't change the OOM victim selection algorithm. > > Signed-off-by: Roman Gushchin <guro@xxxxxx> > Cc: Michal Hocko <mhocko@xxxxxxxx> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > Cc: David Rientjes <rientjes@xxxxxxxxxx> > Cc: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> > Cc: Tejun Heo <tj@xxxxxxxxxx> The semantics make sense to me and the code is straight-forward. With Michal's other feedback incorporated, please feel free to add: Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>