Re: How to handle TIF_MEMDIE stalls?

Johannes Weiner <hannes@xxxxxxxxxxx> · Thu, 19 Feb 2015 06:01:24 -0500

On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote:
> On Wed 18-02-15 21:48:59, Dave Chinner wrote:
> > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote:
> > > On Wed 18-02-15 09:54:30, Dave Chinner wrote:
> [...]
> > Also, this reads as an excuse for the OOM killer being broken and
> > not fixing it.  Keep in mind that we tell the memory alloc/reclaim
> > subsystem that *we hold locks* when we call into it. That's what
> > GFP_NOFS originally meant, and it's what it still means today in an
> > XFS context.
> 
> Sure, and OOM killer will not be invoked in NOFS context. See
> __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where
> is the OOM killer broken.
> 
> The crucial problem we are dealing with is not GFP_NOFAIL triggering the
> OOM killer but a lock dependency introduced by the following sequence:
> 
> 	taskA			taskB			taskC
> lock(A)							alloc()
> alloc(gfp | __GFP_NOFAIL)	lock(A)			  out_of_memory
> # looping for ever if we				    select_bad_process
> # cannot make any progress				      victim = taskB

You don't even need taskC here.  taskA could invoke the OOM killer
with lock(A) held, and taskB getting selected as the victim while
trying to acquire lock(A).  It'll get the signal and TIF_MEMDIE and
then wait for lock(A) while taskA is waiting for it to exit.

But it doesn't matter who is doing the OOM killing - if the allocating
task with the lock/state is waiting for the OOM victim to free memory,
and the victim is waiting for same the lock/state, we have a deadlock.

> There is no way OOM killer can tell taskB is blocked and that there is
> dependency between A and B (without lockdep). That is why I call NOFAIL
> under a lock as dangerous and a bug.

You keep ignoring that it's also one of the main usecases of this
flag.  The caller has state that it can't unwind and thus needs the
allocation to succeed.  Chances are somebody else can get blocked up
on that same state.  And when that somebody else is the first choice
of the OOM killer, we're screwed.

This is exactly why I'm proposing that the OOM killer should not wait
indefinitely for its first choice to exit, but ultimately move on and
try other tasks.  There is no other way to resolve this deadlock.

Preferrably, we'd get rid of all nofail allocations and replace them
with preallocated reserves.  But this is not going to happen anytime
soon, so what other option do we have than resolving this on the OOM
killer side?

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs