Re: [patch 05/18] oom: give current access to memory reserves if it has been killed

David Rientjes <rientjes@xxxxxxxxxx> · Tue, 8 Jun 2010 17:14:42 -0700 (PDT)

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > It's possible to livelock the page allocator if a thread has mm->mmap_sem
> 
> What is the state of this thread?  Trying to allocate memory, I assume.  
> 

Right, which I agree is a bad scenario to be in but indeed does happen 
(and we have a workaround at Google that identifies these particular cases 
and kills the holder of the writelock on mm->mmap_sem).  We have one 
thread holding a readlock on mm->mmap_sem while trying to allocate memory 
so the oom killer becomes a no-op to prevent needless task killing while 
waiting for the killed task to exit, but that killed task can't exit 
because it requires a writelock on the same semaphore.

> > and fails to make forward progress because the oom killer selects another
> > thread sharing the same ->mm to kill that cannot exit until the semaphore
> > is dropped.
> > 
> > The oom killer will not kill multiple tasks at the same time; each oom
> > killed task must exit before another task may be killed.
> 
> This sounds like a quite risky design.  The possibility that we'll
> cause other dead/livelocks similar to this one seems pretty high.  It
> applies to all sleeping locks in the entire kernel, doesn't it?
> 

It applies to any writelock that is taken during the exitpath of an oom 
killed task if a thread holding a readlock is trying to allocate memory 
itself.  This is how it's always been done at least within the past few 
years and we haven't had a problem other than with mm->mmap_sem.  At one 
point we used an oom killer timeout to kill other tasks after a period of 
time had elapsed, but that hasn't been required since we've been killing 
the thread holding the writelock on mm->mmap_sem.

> >  Thus, if one
> > thread is holding mm->mmap_sem and cannot allocate memory, all threads
> > sharing the same ->mm are blocked from exiting as well.  In the oom kill
> > case, that means the thread holding mm->mmap_sem will never free
> > additional memory since it cannot get access to memory reserves and the
> > thread that depends on it with access to memory reserves cannot exit
> > because it cannot acquire the semaphore.  Thus, the page allocators
> > livelocks.
> > 
> > When the oom killer is called and current happens to have a pending
> > SIGKILL, this patch automatically gives it access to memory reserves and
> > returns.  Upon returning to the page allocator, its allocation will
> > hopefully succeed so it can quickly exit and free its memory.  If not, the
> > page allocator will fail the allocation if it is not __GFP_NOFAIL.
> 
> You said "hopefully".
> 

"hopefully" in this case means that the allocation better succeed or we've 
depleted all memory reserves and we're deadlocked, it doesn't mean that 
this is a speculative change that may or may not work.

> Does it actually work?  Any real-world testing results?  If so, they'd
> be a useful addition to the changelog.
> 

It certain does, and prevents needlessly killing another task when we know 
current is exiting.  The nice thing about that is that we don't need to do 
anything like checking if a child should be sacrified or if current is 
OOM_DISABLE: we already know it's dying so it should simply get access to 
memory reserves either to return and handle its pending SIGKILL or 
continue down the exitpath.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>