On Tue, 7 Jan 2014, Andrew Morton wrote: > I just spent a happy half hour reliving this thread and ended up > deciding I agreed with everyone! I appears that many more emails are > needed so I think I'll drop > http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch > for now. > > The claim that > mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch > will impact existing userspace seems a bit dubious to me. > I'm not sure why this was dropped since it's vitally needed for any sane userspace oom handler to be effective. Without the patch, a userspace oom handler waiting on memory.oom_control will be triggered when any process with a pending SIGKILL or in the exit() path simply needs access to memory reserves to make forward progress. The kernel oom killer itself is preempted since nothing is actionable other than giving current access to memory reserves by setting the TIF_MEMDIE bit. Userspace does not have the privilege to set this bit itself, so in such cases there is absolutely nothing actionable for the userspace oom handler. The problem is that the userspace oom handler doesn't know that. It would be ludicrous to require that a userspace oom handler must wait for some arbitrary amount of time to determine if it is actionable or not; what is a sane amount of time to wait? Should we reliably expect that multiple oom notifications will be sent over a period of time if we are in a situation where current doesn't require memory reserves to make forward progress? How long should the userspace oom handler store this state to determine how many times it has woken up? Userspace oom handling implementations are fragile enough as it is, they should be made as trivial as possible to ensure they can do what is needed to make memory available, have the smallest memory footprint possible, and be as reliable as possible. Requiring them to determine when a notification is actionable is troublesome. Furthermore, Section 10 of Documentation/cgroups/memory.txt does not imply that any of this checking needs to be done and lists possible actions that a userspace oom handler can do upon being notified such as raising a limit or killing a process itself. This is what userspace _expects_ to do when notified. Giving current access to memory reserves so that it may make forward progress is something only the kernel can do and is a part of both the VM and memcg implementations to allow forward progress to be made. It is not something userspace is involved in. Additionally, you're not losing any functionality by merging the patch, if you really want to know simply when the limit has been reached and not something actionable as stated by the memcg documentation, you can do so with memory thresholds or VMPRESSURE_CRITICAL. Google relies on this behavior so that userspace oom handlers can be implemented to respond to oom conditions and not cause unnecessary oom killing. We'd like to know why you refuse to provide such an interface in a responsible and reliable way. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>