On Wed, 23 Feb 2011, Andrew Morton wrote: > Your patch still stinks! > > If userspace can't handle a disabled oom-killer then userspace > shouldn't have disabled the oom-killer. > I agree, but userspace may not always be perfect especially on large scale; we, in kernel land, can easily choose to ignore that but it's only a problem because we're providing an interface where the memcg will livelock without userspace intervention. The global oom killer doesn't have this problem and for years it has even radically panicked the machine instead of livelocking EVEN THOUGH other threads, those that are OOM_DISABLE, may be getting work done. This is a memcg-specific issue because memory.oom_control has opened the possibility up to livelock that userspace may have no way of correcting on its own especially when it may be oom itself. The natural conclusion is that you should never set memory.oom_control unless you can guarantee a perfect userspace implementation that will never be unresponsive. At our scale, we can't make that guarantee so memory.oom_control is not helpful at all. If that's the case, then what else do we have at our disposal other than memory.oom_delay_millisecs that allows us to increase a hard limit or kill a job of lower priority other than setting memory thresholds and hoping userspace will schedule and respond before the memcg is completely oom? > How do we fix this properly? > > A little birdie tells me that the offending userspace oom handler is > running in a separate memcg and is not itself running out of memory. It depends on how you configure your memory controllers, but even if it is running in a separate memcg how can you make the conclusion it isn't oom in parallel? > The problem is that the userspace oom handler is also taking peeks into > processes which are in the stressed memcg and is getting stuck on > mmap_sem in the procfs reads. Correct? > That's outside the scope of this feature and is a separate discussion; this patch specifically addresses an issue where a userspace job scheduler wants to take action when a memcg is oom before deferring to the kernel and happens to become unresponsive for whatever reason. > It seems to me that such a userspace oom handler is correctly designed, > and that we should be looking into the reasons why it is unreliable, > and fixing them. Please tell us about this? > The problem isn't specific to any one cause or implementation, we know that userspace programs have bugs, they can stall forever in D-state, they can be oom themselves, they get stuck waiting on a lock, etc etc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>