On Tue, 8 Jun 2010, Andrew Morton wrote: > > It's possible to livelock the page allocator if a thread has mm->mmap_sem > > What is the state of this thread? Trying to allocate memory, I assume. > Right, which I agree is a bad scenario to be in but indeed does happen (and we have a workaround at Google that identifies these particular cases and kills the holder of the writelock on mm->mmap_sem). We have one thread holding a readlock on mm->mmap_sem while trying to allocate memory so the oom killer becomes a no-op to prevent needless task killing while waiting for the killed task to exit, but that killed task can't exit because it requires a writelock on the same semaphore. > > and fails to make forward progress because the oom killer selects another > > thread sharing the same ->mm to kill that cannot exit until the semaphore > > is dropped. > > > > The oom killer will not kill multiple tasks at the same time; each oom > > killed task must exit before another task may be killed. > > This sounds like a quite risky design. The possibility that we'll > cause other dead/livelocks similar to this one seems pretty high. It > applies to all sleeping locks in the entire kernel, doesn't it? > It applies to any writelock that is taken during the exitpath of an oom killed task if a thread holding a readlock is trying to allocate memory itself. This is how it's always been done at least within the past few years and we haven't had a problem other than with mm->mmap_sem. At one point we used an oom killer timeout to kill other tasks after a period of time had elapsed, but that hasn't been required since we've been killing the thread holding the writelock on mm->mmap_sem. > > Thus, if one > > thread is holding mm->mmap_sem and cannot allocate memory, all threads > > sharing the same ->mm are blocked from exiting as well. In the oom kill > > case, that means the thread holding mm->mmap_sem will never free > > additional memory since it cannot get access to memory reserves and the > > thread that depends on it with access to memory reserves cannot exit > > because it cannot acquire the semaphore. Thus, the page allocators > > livelocks. > > > > When the oom killer is called and current happens to have a pending > > SIGKILL, this patch automatically gives it access to memory reserves and > > returns. Upon returning to the page allocator, its allocation will > > hopefully succeed so it can quickly exit and free its memory. If not, the > > page allocator will fail the allocation if it is not __GFP_NOFAIL. > > You said "hopefully". > "hopefully" in this case means that the allocation better succeed or we've depleted all memory reserves and we're deadlocked, it doesn't mean that this is a speculative change that may or may not work. > Does it actually work? Any real-world testing results? If so, they'd > be a useful addition to the changelog. > It certain does, and prevents needlessly killing another task when we know current is exiting. The nice thing about that is that we don't need to do anything like checking if a child should be sacrified or if current is OOM_DISABLE: we already know it's dying so it should simply get access to memory reserves either to return and handle its pending SIGKILL or continue down the exitpath. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>