Re: [patch] memcg: add oom killer delay

David Rientjes <rientjes@xxxxxxxxxx> · Wed, 23 Feb 2011 16:51:24 -0800 (PST)

On Wed, 23 Feb 2011, Andrew Morton wrote:

> Your patch still stinks!
> 
> If userspace can't handle a disabled oom-killer then userspace
> shouldn't have disabled the oom-killer.
> 

I agree, but userspace may not always be perfect especially on large 
scale; we, in kernel land, can easily choose to ignore that but it's only 
a problem because we're providing an interface where the memcg will 
livelock without userspace intervention.  The global oom killer doesn't 
have this problem and for years it has even radically panicked the machine 
instead of livelocking EVEN THOUGH other threads, those that are 
OOM_DISABLE, may be getting work done.

This is a memcg-specific issue because memory.oom_control has opened the 
possibility up to livelock that userspace may have no way of correcting on 
its own especially when it may be oom itself.  The natural conclusion is 
that you should never set memory.oom_control unless you can guarantee a 
perfect userspace implementation that will never be unresponsive.  At our 
scale, we can't make that guarantee so memory.oom_control is not helpful 
at all.

If that's the case, then what else do we have at our disposal other than 
memory.oom_delay_millisecs that allows us to increase a hard limit or kill 
a job of lower priority other than setting memory thresholds and hoping 
userspace will schedule and respond before the memcg is completely oom?

> How do we fix this properly?
> 
> A little birdie tells me that the offending userspace oom handler is
> running in a separate memcg and is not itself running out of memory. 

It depends on how you configure your memory controllers, but even if it is 
running in a separate memcg how can you make the conclusion it isn't oom 
in parallel?

> The problem is that the userspace oom handler is also taking peeks into
> processes which are in the stressed memcg and is getting stuck on
> mmap_sem in the procfs reads.  Correct?
> 

That's outside the scope of this feature and is a separate discussion; 
this patch specifically addresses an issue where a userspace job scheduler 
wants to take action when a memcg is oom before deferring to the kernel 
and happens to become unresponsive for whatever reason.

> It seems to me that such a userspace oom handler is correctly designed,
> and that we should be looking into the reasons why it is unreliable,
> and fixing them.  Please tell us about this?
> 

The problem isn't specific to any one cause or implementation, we know 
that userspace programs have bugs, they can stall forever in D-state, they 
can be oom themselves, they get stuck waiting on a lock, etc etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>