Re: [PATCH] mm,page_alloc: wait for oom_lock than back off

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Wed, 21 Feb 2018 23:27:05 +0900

Michal Hocko wrote:
> On Tue 20-02-18 22:32:56, Tetsuo Handa wrote:
> > >From c3b6616238fcd65d5a0fdabcb4577c7e6f40d35e Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> > Date: Tue, 20 Feb 2018 11:07:23 +0900
> > Subject: [PATCH] mm,page_alloc: wait for oom_lock than back off
> > 
> > This patch fixes a bug which is essentially same with a bug fixed by
> > commit 400e22499dd92613 ("mm: don't warn about allocations which stall for
> > too long").
> > 
> > Currently __alloc_pages_may_oom() is using mutex_trylock(&oom_lock) based
> > on an assumption that the owner of oom_lock is making progress for us. But
> > it is possible to trigger OOM lockup when many threads concurrently called
> > __alloc_pages_slowpath() because all CPU resources are wasted for pointless
> > direct reclaim efforts. That is, schedule_timeout_uninterruptible(1) in
> > __alloc_pages_may_oom() does not always give enough CPU resource to the
> > owner of the oom_lock.
> > 
> > It is possible that the owner of oom_lock is preempted by other threads.
> > Preemption makes the OOM situation much worse. But the page allocator is
> > not responsible about wasting CPU resource for something other than memory
> > allocation request. Wasting CPU resource for memory allocation request
> > without allowing the owner of oom_lock to make forward progress is a page
> > allocator's bug.
> > 
> > Therefore, this patch changes to wait for oom_lock in order to guarantee
> > that no thread waiting for the owner of oom_lock to make forward progress
> > will not consume CPU resources for pointless direct reclaim efforts.
> 
> So instead we will have many tasks sleeping on the lock and prevent the
> oom reaper to make any forward progress. This is not a solution without
> further steps. Also I would like to see a real life workload that would
> benefit from this.

Of course I will propose follow-up patches. We already discussed that it is
safe to use ALLOC_WMARK_MIN for last second allocation attempt with oom_lock
held and ALLOC_OOM for OOM victim's last second allocation attempt with
oom_lock held. We don't need to serialize whole __oom_reap_task_mm() using
oom_lock; we need to serialize only setting of MMF_OOM_SKIP using oom_lock.
(We won't need oom_lock serialization for setting MMF_OOM_SKIP if everyone
can agree with doing last second allocation attempt with oom_lock held after
confirming that there is no !MMF_OOM_SKIP mm. But we could not agree it.)
Even more, we could try direct OOM reaping than schedule_timeout_killable(1)
if preventing the OOM reaper kernel thread is a problem, for we should be
able to concurrently run __oom_reap_task_mm() because we allow exit_mmap()
and __oom_reap_task_mm() to run concurrently.

We know printk() from OOM situation where a lot of threads are doing almost
busy-looping is a nightmare. printk() with oom_lock held can start utilizing
CPU resources saved by this patch (and reduce preemption during printk(),
making printk() complete faster) is already a benefit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>