On Tue 06-01-15 15:27:27, Greg Thelen wrote: > On Tue, Jan 06 2015, Michal Hocko wrote: > > > - As it turned out recently GFP_KERNEL mimicing GFP_NOFAIL for !costly > > allocation is sometimes kicking us back because we are basically > > creating an invisible lock dependencies which might livelock the whole > > system under OOM conditions. > > That leads to attempts to add more hacks into the OOM killer > > which is tricky enough as is. Changing the current state is > > quite risky because we do not really know how many places in the > > kernel silently depend on this behavior. As per Johannes attempt > > (http://marc.info/?l=linux-mm&m=141932770811346) it is clear that > > we are not yet there! I do not have very good ideas how to deal with > > this unfortunatelly... > > We've internally been fighting similar deadlocks between memcg kmem > accounting and memcg oom killer. I wouldn't call it a very good idea, > because it falls in the realm of further complicating the oom killer, > but what about introducing an async oom killer which runs outside of the > context of the current task. I am not sure I understand you properly. We have something similar for memcg in upstream. It is still from the context of the task which has tripped over the OOM but it happens down in the page fault path where no locks are held. This has fixed the similar lock dependency problem in memcg charges, which can happen on top of any locks, but it is still not enough, see below. > An async killer won't hold any locks so it > won't block the indented oom victim from terminating. After queuing a > deferred oom kill the allocating thread would then be able to dip into > memory reserves to satisfy its too-small-to-fail allocation. What would prevent the current to consume all the memory reserves because the victim wouldn't die early enough (e.g. it won't be scheduled or spend a lot of time on an unrelated lock)? Each "current" which blocks the oom victim would have to get access to the reserves. There might be really lots of them... I think that we shouldn't give anybody but OOM victim access to the reserves because there is a good chance that the victim will not use too much of it (unless there is a bug somewhere where the victim allocates unbounded amount of memory without bailing out on fatal_signals_pending). I am pretty sure that we can extend lockdep to report when OOM victim is going to block on a lock which is held by a task which is allocating on almost-never-fail gfp (there is already GFP_FS tracking implemented AFAIR). But that wouldn't solve the problem, though, because it would turn into, as Dave pointed out, "whack a mole" game. Instead we shouldn't pretend that GFP_KERNEL is basically GFP_NOFAIL. The question is how to get there without too many regressions IMHO. Or maybe we should simply bite a bullet and don't be cowards and simply deal with bugs as they come. If something really cannot deal with the failure it should tell that by a proper flag. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>