On Mon, 29 Mar 2010, Oleg Nesterov wrote: > Can't comment, I do not understand these subtleties. > > But I'd like to note that fatal_signal_pending() can be true when the > process wasn't killed, but another thread does exit_group/exec. > I'm not sure there's a difference between whether a process was oom killed and received a SIGKILL that way or whether exit_group(2) was used, so I don't think we need to test for (p->signal->flags & SIGNAL_GROUP_EXIT) here. We do need to guarantee that exiting tasks always can get memory, which is the responsibility of setting TIF_MEMDIE. The only thing this patch does is defer calling the oom killer when a task has a pending SIGKILL and then fail the allocation when it would otherwise repeat. Instead of the considerable risk involved with no failing GFP_KERNEL allocations that are under PAGE_ALLOC_COSTLY_ORDER that is typically never done, it may make more sense to retry the allocation with TIF_MEMDIE on the second iteration: in essence, automatically selecting current for oom kill regardless of other oom killed tasks if it already has a pending SIGKILL. oom: give current access to memory reserves if it has been killed It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically selects it for kill so that it has access to memory reserves and the better timeslice. Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory. Cc: Mel Gorman <mel@xxxxxxxxx> Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx> --- mm/oom_kill.c | 10 ++++++++++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -681,6 +681,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, } /* + * If current has a pending SIGKILL, then automatically select it. The + * goal is to allow it to allocate so that it may quickly exit and free + * its memory. + */ + if (fatal_signal_pending(current)) { + __oom_kill_task(current); + return; + } + + /* * Check if there were limitations on the allocation (only relevant for * NUMA) that may require different handling. */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>