Re: [PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Sat, 19 Sep 2015 23:33:07 +0900

Michal Hocko wrote:
> This has been posted in various forms many times over past years. I
> still do not think this is a right approach of dealing with the problem.

I do not think "GFP_NOFS can fail" patch is a right approach because
that patch easily causes messages like below.

  Buffer I/O error on dev sda1, logical block 34661831, lost async page write
  XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)
  XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250)
  XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250)

Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway.

I believe choosing more OOM victims is the only way which can solve OOM stalls.

> You can quickly deplete memory reserves this way without making further
> progress (I am afraid you can even trigger this from userspace without
> having big privileges) so even administrator will have no way to
> intervene.

I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying
cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the
OOM victim task as soon as possible, but it turned out that it will not
work if there is invisible lock dependency. Therefore, why not to give up
"there should be only up to 1 TIF_MEMDIE task" rule?

What this patch (and many others posted in various forms many times over
past years) does is to give up "there should be only up to 1 TIF_MEMDIE
task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks
and somehow manage in a way memory reserves will not deplete.

In my proposal which favors all fatal_signal_pending() tasks evenly
( http://lkml.kernel.org/r/201509102318.GHG18789.OHMSLFJOQFOtFV@xxxxxxxxxxxxxxxxxxx )
suggests that the OOM victim task unlikely needs all of memory reserves.
In other words, the OOM victim task can likely make forward progress
if some amount of memory reserves are allowed (compared to normal tasks
waiting for memory).

So, I think that getting rid of "ALLOC_NO_WATERMARKS via TIF_MEMDIE" rule
and replace test_thread_flag(TIF_MEMDIE) with fatal_signal_pending(current)
will handle many cases if fatal_signal_pending() tasks are allowed to access
some amount of memory reserves. And my proposal which chooses next OOM
victim upon timeout will handle the remaining cases without depleting
memory reserves.

If you still want to keep "there should be only up to 1 TIF_MEMDIE task"
rule, what alternative do you have? (I do not like panic_on_oom_timeout
because it is more data-lossy approach than choosing next OOM victim.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>