Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

David Rientjes <rientjes@xxxxxxxxxx> · Mon, 21 Sep 2015 16:27:31 -0700 (PDT)

On Fri, 18 Sep 2015, Christoph Lameter wrote:

> Subject: Allow multiple kills from the OOM killer
> 
> The OOM killer currently aborts if it finds a process that already is having
> access to the reserve memory pool for exit processing. This is done so that
> the reserves are not overcommitted but on the other hand this also allows
> only one process being oom killed at the time. That process may be stuck
> in D state.
> 
> Signed-off-by: Christoph Lameter <cl@xxxxxxxxx>
> 
> Index: linux/mm/oom_kill.c
> ===================================================================
> --- linux.orig/mm/oom_kill.c	2015-09-18 11:58:52.963946782 -0500
> +++ linux/mm/oom_kill.c	2015-09-18 11:59:42.010684778 -0500
> @@ -264,10 +264,9 @@ enum oom_scan_t oom_scan_process_thread(
>  	 * This task already has access to memory reserves and is being killed.
>  	 * Don't allow any other task to have access to the reserves.
>  	 */
> -	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> -		if (oc->order != -1)
> -			return OOM_SCAN_ABORT;
> -	}
> +	if (test_tsk_thread_flag(task, TIF_MEMDIE))
> +		return OOM_SCAN_CONTINUE;
> +
>  	if (!task->mm)
>  		return OOM_SCAN_CONTINUE;
> 

If this would result in the newly chosen process being guaranteed to exit, 
this would be fine.  Unfortunately, no such guarantee is possible.  If a 
thread is holding a contended mutex that the victim(s) require, this 
serial oom killer could eventually panic the system if that thread is 
OOM_DISABLE.

The solution that we have merged internally is described at 
http://marc.info/?l=linux-kernel&m=144010444913702 -- we provide access to 
memory reserves to processes that find a stalled exit in the oom killer so 
that they may allocate.  It comes along with a test module that takes a 
contended mutex and ensures that forward progress is made as long as 
memory reserves are not depleted.  We can't actually guarantee that memory 
reserves won't be depleted, but we (1) hope that nobody is actually 
allocating a lot of memory before dropping a mutex and (2) want to avoid 
the alternative which is a system livelock.

This will address situations such as

	allocator			oom victim
	---------			----------
	mutex_lock(lock)
	alloc_pages(GFP_KERNEL)
					mutex_lock(lock)
					mutex_unlock(lock)
					handle SIGKILL

since this otherwise results in a livelock without a solution such as 
mine since the GFP_KERNEL allocation stalls forever waiting for the oom 
victim to acquire the mutex and exit.  This also works if the allocator is 
OOM_DISABLE.

This won't handle other situations where the victim gets wedged in D state 
and is not allocating memory, but this is by far the more common 
occurrence that we have dealt with.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>