On Fri 05-06-15 04:29:36, Tejun Heo wrote: > Hello, Michal. > > On Thu, Jun 04, 2015 at 11:30:31AM +0200, Michal Hocko wrote: > > > Hmmm? In -mm, if __alloc_page_may_oom() fails trylock, it never calls > > > out_of_memory(). > > > > Sure but the oom_lock might be free already. out_of_memory doesn't wait > > for the victim to finish. It just does schedule_timeout_killable. > > That doesn't matter because the detection and TIF_MEMDIE assertion are > atomic w.r.t. oom_lock and TIF_MEMDIE essentially extends the locking > by preventing further OOM kills. Am I missing something? This is true but TIF_MEMDIE releasing is not atomic wrt. the allocation path. So the oom victim could have released memory and dropped TIF_MEMDIE but the allocation path hasn't noticed that because it's passed /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if * we're still under heavy pressure. */ page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order, ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac); and goes on to kill another task because there is no TIF_MEMDIE anymore. > > > The main difference here is that the alloc path does the whole thing > > > synchrnously and thus the OOM detection and killing can be put in the > > > same critical section which isn't the case for the memcg OOM handling. > > > > This is true but there is still a time window between the last > > allocation attempt and out_of_memory when the OOM victim might have > > exited and another task would be selected. > > Please see above. > > > > > This is not the only reason. In-kernel memcg oom handling needs it > > > > as well. See 3812c8c8f395 ("mm: memcg: do not trap chargers with > > > > full callstack on OOM"). In fact it was the in-kernel case which has > > > > triggered this change. We simply cannot wait for oom with the stack and > > > > all the state the charge is called from. > > > > > > Why should this be any different from OOM handling from page allocator > > > tho? > > > > Yes the global OOM is prone to deadlock. This has been discussed a lot > > and we still do not have a good answer for that. The primary problem > > is that small allocations do not fail and retry indefinitely so an OOM > > victim might be blocked on a lock held by a task which is the allocator. > > This is less likely and harder to trigger with standard loads than in > > memcg environment though. > > Deadlocks from infallible allocations getting interlocked are > different. OOM killer can't really get around that by itself but I'm > not talking about those deadlocks but at the same time they're a lot > less likely. It's about OOM victim trapped in a deadlock failing to > release memory because someone else is waiting for that memory to be > released while blocking the victim. I thought those would be in the allocator context - which was the example I've provided. What kind of context do you have in mind? > Sure, the two issues are related > but once you solve things getting blocked on single OOM victim, it > becomes a lot less of an issue. > > > There have been suggestions to add an OOM timeout and ignore the > > previous OOM victim after the timeout expires and select a new > > victim. This sounds attractive but this approach has its own problems > > (http://marc.info/?l=linux-mm&m=141686814824684&w=2). > > Here are the the issues the message lists Let's focus on discussing those points in reply to Johannes' email. AFAIU your notes very in line with his. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html