On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote: > On Wed 18-02-15 21:48:59, Dave Chinner wrote: > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > > On Wed 18-02-15 09:54:30, Dave Chinner wrote: > [...] > > Also, this reads as an excuse for the OOM killer being broken and > > not fixing it. Keep in mind that we tell the memory alloc/reclaim > > subsystem that *we hold locks* when we call into it. That's what > > GFP_NOFS originally meant, and it's what it still means today in an > > XFS context. > > Sure, and OOM killer will not be invoked in NOFS context. See > __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where > is the OOM killer broken. > > The crucial problem we are dealing with is not GFP_NOFAIL triggering the > OOM killer but a lock dependency introduced by the following sequence: > > taskA taskB taskC > lock(A) alloc() > alloc(gfp | __GFP_NOFAIL) lock(A) out_of_memory > # looping for ever if we select_bad_process > # cannot make any progress victim = taskB You don't even need taskC here. taskA could invoke the OOM killer with lock(A) held, and taskB getting selected as the victim while trying to acquire lock(A). It'll get the signal and TIF_MEMDIE and then wait for lock(A) while taskA is waiting for it to exit. But it doesn't matter who is doing the OOM killing - if the allocating task with the lock/state is waiting for the OOM victim to free memory, and the victim is waiting for same the lock/state, we have a deadlock. > There is no way OOM killer can tell taskB is blocked and that there is > dependency between A and B (without lockdep). That is why I call NOFAIL > under a lock as dangerous and a bug. You keep ignoring that it's also one of the main usecases of this flag. The caller has state that it can't unwind and thus needs the allocation to succeed. Chances are somebody else can get blocked up on that same state. And when that somebody else is the first choice of the OOM killer, we're screwed. This is exactly why I'm proposing that the OOM killer should not wait indefinitely for its first choice to exit, but ultimately move on and try other tasks. There is no other way to resolve this deadlock. Preferrably, we'd get rid of all nofail allocations and replace them with preallocated reserves. But this is not going to happen anytime soon, so what other option do we have than resolving this on the OOM killer side? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>