Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree

Johannes Weiner <hannes@xxxxxxxxxxx> · Wed, 27 Nov 2013 22:13:13 -0500

On Wed, Nov 27, 2013 at 06:38:31PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
> 
> > > The task that is bypassing the memcg charge to the root memcg may not be 
> > > the process that is chosen by the oom killer, and it's possible the amount 
> > > of memory freed by killing the victim is less than the amount of memory 
> > > bypassed.
> > 
> > That's true, though unlikely.
> > 
> 
> Well, the "goto bypass" allows it and it's trivial to cause by 
> manipulating /proc/pid/oom_score_adj values to prefer processes with very 
> little rss.  It will just continue looping and killing processes as they 
> are forked and never cause the memcg to free memory below its limit.  At 
> least the "goto nomem" allows us to free some memory instead of leaking to 
> the root memcg.

Yes, that's the better way of doing it, I'll send the patch.  Thanks.

> > > Were you targeting these to 3.13 instead?  If so, it would have already 
> > > appeared in 3.13-rc1 anyway.  Is it still a work in progress?
> > 
> > I don't know how to answer this question.
> > 
> 
> It appears as though this work is being developed in Linus's tree rather 
> than -mm, so I'm asking if we should consider backing some of it out for 
> 3.14 instead.

The changes fix a deadlock problem.  Are they creating problems that
are worse than deadlocks, that would justify their revert?

> > > Should we be checking mem_cgroup_margin() here to ensure 
> > > task_in_memcg_oom() is still accurate and we haven't raced by freeing 
> > > memory?
> > 
> > We would have invoked the OOM killer long before this point prior to
> > my patches.  There is a line we draw and from that point on we start
> > killing things.  I tried to explain multiple times now that there is
> > no race-free OOM killing and I'm tired of it.  Convince me otherwise
> > or stop repeating this non-sense.
> > 
> 
> In our internal kernel we call mem_cgroup_margin() with the order of the 
> charge immediately prior to sending the SIGKILL to see if it's still 
> needed even after selecting the victim.  It makes the race smaller.
> 
> It's obvious that after the SIGKILL is sent, either from the kernel or 
> from userspace, that memory might subsequently be freed or another process 
> might exit before the process killed could even wake up.  There's nothing 
> we can do about that since we don't have psychic abilities.  I think we 
> should try to reduce the chance for unnecessary oom killing as much as 
> possible, however.

Since we can't physically draw a perfect line, we should strive for a
reasonable and intuitive line.  After that it's rapidly diminishing
returns.  Killing something after that much reclaim effort without
success is a completely reasonable and intuitive line to draw.  It's
also the line that has been drawn a long time ago and we're not
breaking this because of a micro optmimization.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>