Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

David Rientjes <rientjes@xxxxxxxxxx> · Thu, 5 Oct 2017 14:53:02 -0700 (PDT)

On Thu, 5 Oct 2017, Johannes Weiner wrote:

> > It is, because it can quite clearly be a DoSand was prevented with 
> > Roman's earlier design of iterating usage up the hierarchy and comparing 
> > siblings based on that criteria.  I know exactly why he chose that 
> > implementation detail early on, and it was to prevent cases such as this 
> > and to not let userspace hide from the oom killer.
> 
> This doesn't address how it's different from a single process
> following the same pattern right now.
> 

Are you referring to a single process being rewritten into N different 
subprocesses that do the same work as the single process but is separated 
in this manner to avoid having a large rss for any single process to avoid  
being oom killed?

This is solved by a cgroup-aware oom killer because these subprocesses 
should not be able to escape their own chargable entity.  It's exactly the 
usecase that Roman is addressing, correct?  My suggestion is to continue 
to iterate the usage up the hierarchy so that users can't easily defeat 
this by creating N subcontainers instead.

> > Let's resolve that global oom is a real condition and getting into that 
> > situation is not a userspace problem.  It's the result of overcommiting 
> > the system, and is used in the enterprise to address business goals.  If 
> > the above is true, and its up to memcg to prevent global oom in the first 
> > place, then this entire patchset is absolutely pointless.  Limit userspace 
> > to 95% of memory and when usage is approaching that limit, let userspace 
> > attached to the root memcg iterate the hierarchy itself and kill from the 
> > largest consumer.
> > 
> > This patchset exists because overcommit is real, exactly the same as 
> > overcommit within memcg hierarchies is real.  99% of the time we don't run 
> > into global oom because people aren't using their limits so it just works 
> > out.  1% of the time we run into global oom and we need a decision to made 
> > based for forward progress.  Using Michal's earlier example of admins and 
> > students, a student can easily use all of his limit and also, with v10 of 
> > this patchset, 99% of the time avoid being oom killed just by forking N 
> > processes over N cgroups.  It's going to oom kill an admin every single 
> > time.
> 
> We overcommit too, but our workloads organize themselves based on
> managing their resources, not based on evading the OOM killer. I'd
> wager that's true for many if not most users.
> 

No workloads are based on evading the oom killer, we are specifically 
trying to avoid that with oom priorities.  They have the power over 
increasing their own priority to be preferred to kill, but not decreasing 
their oom priority that was set by an activity manager.  This is exactly 
the same as how /proc/pid/oom_score_adj works.  With a cgroup-aware oom 
killer, which we'd love, nothing can possibly evade the oom killer if 
there are oom priorities.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html