On Mon, 7 Mar 2011, Andrew Morton wrote: > > It could be, if users assign the handler to a different memcg; otherwise, > > it's guaranteed. > > Putting the handler into the same container would be rather daft. > > If userspace is going to elect to take over a kernel function then it > should be able to perform that function reliably. We don't have hacks > in the kernel to stop runaway SCHED_FIFO tasks, either. If the oom > handler has put itself into a memcg and then has permitted that memcg > to go oom then userspace is busted. > We have a container specifically for daemons like this and have struggled for years to accurately predict how much memory it needs and what to do when it is oom. The problem, in this case, is that when it's oom it's too late: the memcg is livelocked and then no memory limits on the system have a chance of getting increased and nothing in oom memcgs are guaranteed to ever make forward progress again. That's why I keep bringing up the point that this patch is not a bugfix: it's an extension of a feature (memory.oom_control) to allow userspace a period of time to respond to memcgs reaching their hard limit before killing something. For our container with vital system daemons, this is absolutely mandatory if something consumes a large amount of memory and needs to be restarted; we want the logic in userspace to determine what to do without killing vital tasks or panicking. We want to use the oom killer only as a last resort and that can effectively be done with this patch and not with memory.oom_control (and I think this is why Kame acked it). > My issue with this patch is that it extends the userspace API. This > means we're committed to maintaining that interface *and its behaviour* > for evermore. But the oom-killer and memcg are both areas of intense > development and the former has a habit of getting ripped out and > rewritten. Committing ourselves to maintaining an extension to the > userspace interface is a big thing, especially as that extension is > somewhat tied to internal implementation details and is most definitely > tied to short-term inadequacies in userspace and in the kernel > implementation. > The same could have been said for memory.oom_control to disable the oom killer entirely which no seems to be solidified as the only way to influence oom killer behavior from the kernel and now we're locked into that limitation because we don't want dual interfaces. I think this patch would have been received much better prior to memory.oom_control since it allows for the same behavior with an infinite timeout. memory.oom_control is not an option for us since we can't guarantee that any userspace daemon at our scale will ever be responsive 100% of the time. I don't think the idea of a userspace grace period when a memcg is oom is that abstract, though. I think applications should have the opportunity to free some of their own memory first when oom instead of abruptly killing something and restarting it. So, in the end, we may have to carry this patch internally forever but I think as memcg becomes more popular we'll have a higher demand for such a grace period. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>