Hello, Michal. On Thu, Jun 04, 2015 at 11:30:31AM +0200, Michal Hocko wrote: > > Hmmm? In -mm, if __alloc_page_may_oom() fails trylock, it never calls > > out_of_memory(). > > Sure but the oom_lock might be free already. out_of_memory doesn't wait > for the victim to finish. It just does schedule_timeout_killable. That doesn't matter because the detection and TIF_MEMDIE assertion are atomic w.r.t. oom_lock and TIF_MEMDIE essentially extends the locking by preventing further OOM kills. Am I missing something? > > The main difference here is that the alloc path does the whole thing > > synchrnously and thus the OOM detection and killing can be put in the > > same critical section which isn't the case for the memcg OOM handling. > > This is true but there is still a time window between the last > allocation attempt and out_of_memory when the OOM victim might have > exited and another task would be selected. Please see above. > > > This is not the only reason. In-kernel memcg oom handling needs it > > > as well. See 3812c8c8f395 ("mm: memcg: do not trap chargers with > > > full callstack on OOM"). In fact it was the in-kernel case which has > > > triggered this change. We simply cannot wait for oom with the stack and > > > all the state the charge is called from. > > > > Why should this be any different from OOM handling from page allocator > > tho? > > Yes the global OOM is prone to deadlock. This has been discussed a lot > and we still do not have a good answer for that. The primary problem > is that small allocations do not fail and retry indefinitely so an OOM > victim might be blocked on a lock held by a task which is the allocator. > This is less likely and harder to trigger with standard loads than in > memcg environment though. Deadlocks from infallible allocations getting interlocked are different. OOM killer can't really get around that by itself but I'm not talking about those deadlocks but at the same time they're a lot less likely. It's about OOM victim trapped in a deadlock failing to release memory because someone else is waiting for that memory to be released while blocking the victim. Sure, the two issues are related but once you solve things getting blocked on single OOM victim, it becomes a lot less of an issue. > There have been suggestions to add an OOM timeout and ignore the > previous OOM victim after the timeout expires and select a new > victim. This sounds attractive but this approach has its own problems > (http://marc.info/?l=linux-mm&m=141686814824684&w=2). Here are the the issues the message lists (1) you can needlessly panic the machine because no other processes are eligible for oom kill after declaring that the first oom kill victim cannot make progress, This is extremely unlikely unless most processes in the system are involved in the same deadlock. All processes have SIGKILL pending but nobody can exit? In such cases, panic prolly isn't such a bad idea. I mean, where would you go from there? (2) it can lead to unnecessary oom killing if the oom kill victim can exit but hasn't be scheduled or is in the process of exiting, It's a matter of having a reasonable timeout. OOM killing isn't an exact operation to begin with and if an OOM victim fails to release memory in, say 10s or whatever, finding another target is the right thing to do. (3) you can easily turn the oom killer into a serial oom killer since there's no guarantee the next process that is chosen won't be affected by the same problem, and And how is that worse than deadlocking? OOM killer is a mechanism to prevent the system from complete lockup at the cost of essentially randomly butchering its workload. The nasty userland memcg OOM hack aside, by the time OOM killing has engaged, the system is already at the end of the rope. (4) this doesn't fix the problem if an oom disabled process is wedged trying to allocate memory while holding a mutex that others are waiting *All* others in the system are waiting on this particular OOM disabled process and nobody can release any memory? Yeah, panic then. The arguments in that message aren't really against adding timeouts but a lot more for wholesale removal of OOM killing. That's an awesome goal but is way far fetched at the moment. > I am convinced that a more appropriate solution for this is to not > pretend that small allocation never fail and start failing them after > OOM killer is not able to make any progress (GFP_NOFS allocations would > be the first candidate and the easiest one to trigger deadlocks via > i_mutex). Johannes was also suggesting an OOM memory reserve which would > be used for OOM contexts. I don't follow why you reached such conclusion. The arguments don't really make sense to me. Once you accept that OOM killer is a sledgehammer rather than a surgical blade, the direction to take seems pretty obvious to me and it *can't* be a precision mechanism - no matter what, it's killing a random process with SIGKILL. > Also OOM killer can be improved and shrink some of the victims memory > before killing it (e.g. drop private clean pages and their page tables). And why would we go to that level of sophiscation. Just wait a while and kill more until it gets unwedged. That will achieve most effects of being a lot more sophiscated with a lot less complexity and again those minute differences don't matter here. > > Gees... I dislike this approach even more. Grabbng the oom lock and > > doing everything synchronously with timeout will be far simpler and > > easier to follow. > > It might sound easier but it has its own problems... I'm still failing to see what the problems are. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>