On Sat 24-02-18 17:00:51, Tetsuo Handa wrote: > >From d922dd170c2bed01a775e8cca0871098aecc253d Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> > Date: Sat, 24 Feb 2018 16:49:21 +0900 > Subject: [PATCH v2] mm,page_alloc: wait for oom_lock than back off > > This patch fixes a bug which is essentially same with a bug fixed by > commit 400e22499dd92613 ("mm: don't warn about allocations which stall for > too long"). > > Currently __alloc_pages_may_oom() is using mutex_trylock(&oom_lock) based > on an assumption that the owner of oom_lock is making progress for us. But > it is possible to trigger OOM lockup when many threads concurrently called > __alloc_pages_slowpath() because all CPU resources are wasted for pointless > direct reclaim efforts. That is, schedule_timeout_uninterruptible(1) in > __alloc_pages_may_oom() does not always give enough CPU resource to the > owner of the oom_lock. > > It is possible that the owner of oom_lock is preempted by other threads. > Preemption makes the OOM situation much worse. But the page allocator is > not responsible about wasting CPU resource for something other than memory > allocation request. Wasting CPU resource for memory allocation request > without allowing the owner of oom_lock to make forward progress is a page > allocator's bug. > > Therefore, this patch changes to wait for oom_lock in order to guarantee > that no thread waiting for the owner of oom_lock to make forward progress > will not consume CPU resources for pointless direct reclaim efforts. > > We know printk() from OOM situation where a lot of threads are doing almost > busy-looping is a nightmare. As a side effect of this patch, printk() with > oom_lock held can start utilizing CPU resources saved by this patch (and > reduce preemption during printk(), making printk() complete faster). > > By changing !mutex_trylock(&oom_lock) with mutex_lock_killable(&oom_lock), > it is possible that many threads prevent the OOM reaper from making forward > progress. Thus, this patch removes mutex_lock(&oom_lock) from the OOM > reaper. > > Also, since nobody uses oom_lock serialization when setting MMF_OOM_SKIP > and we don't try last second allocation attempt after confirming that there > is no !MMF_OOM_SKIP OOM victim, the possibility of needlessly selecting > more OOM victims will be increased if we continue using ALLOC_WMARK_HIGH. > Thus, this patch changes to use ALLOC_MARK_MIN. > > Also, since we don't want to sleep with oom_lock held so that we can allow > threads waiting at mutex_lock_killable(&oom_lock) to try last second > allocation attempt (because the OOM reaper starts reclaiming memory without > waiting for oom_lock) and start selecting next OOM victim if necessary, > this patch changes the location of the short sleep from inside of oom_lock > to outside of oom_lock. This patch does three different things mangled into one patch. All that with a patch description which talks a lot but doesn't really explain those changes. Moreover, you are effectively tunning for an overloaded page allocator artifical test case and add a central lock where many tasks would block. I have already tried to explain that this is not an universal win and you should better have a real life example where this is really helpful. While I do agree that removing the oom_lock from __oom_reap_task_mm is a sensible thing, changing the last allocation attempt to ALLOC_WMARK_MIN is not all that straightforward and it would require much more detailed explaination. So the patch in its current form is not mergeable IMHO. > Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > Cc: Michal Hocko <mhocko@xxxxxxxx> > Cc: David Rientjes <rientjes@xxxxxxxxxx> > --- > mm/oom_kill.c | 35 +++-------------------------------- > mm/page_alloc.c | 30 ++++++++++++++++++------------ > 2 files changed, 21 insertions(+), 44 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 8219001..802214f 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -491,22 +491,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > struct vm_area_struct *vma; > bool ret = true; > > - /* > - * We have to make sure to not race with the victim exit path > - * and cause premature new oom victim selection: > - * __oom_reap_task_mm exit_mm > - * mmget_not_zero > - * mmput > - * atomic_dec_and_test > - * exit_oom_victim > - * [...] > - * out_of_memory > - * select_bad_process > - * # no TIF_MEMDIE task selects new victim > - * unmap_page_range # frees some memory > - */ > - mutex_lock(&oom_lock); > - > if (!down_read_trylock(&mm->mmap_sem)) { > ret = false; > trace_skip_task_reaping(tsk->pid); > @@ -581,7 +565,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > > trace_finish_task_reaping(tsk->pid); > unlock_oom: > - mutex_unlock(&oom_lock); > return ret; > } > > @@ -1078,7 +1061,6 @@ bool out_of_memory(struct oom_control *oc) > { > unsigned long freed = 0; > enum oom_constraint constraint = CONSTRAINT_NONE; > - bool delay = false; /* if set, delay next allocation attempt */ > > if (oom_killer_disabled) > return false; > @@ -1128,10 +1110,8 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > - if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) { > - delay = true; > + if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) > goto out; > - } > > select_bad_process(oc); > /* Found nothing?!?! Either we hang forever, or we panic. */ > @@ -1139,20 +1119,10 @@ bool out_of_memory(struct oom_control *oc) > dump_header(oc, NULL); > panic("Out of memory and no killable processes...\n"); > } > - if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) { > + if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) > oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : > "Memory cgroup out of memory"); > - delay = true; > - } > - > out: > - /* > - * Give the killed process a good chance to exit before trying > - * to allocate memory again. > - */ > - if (delay) > - schedule_timeout_killable(1); > - > return !!oc->chosen_task; > } > > @@ -1178,4 +1148,5 @@ void pagefault_out_of_memory(void) > return; > out_of_memory(&oc); > mutex_unlock(&oom_lock); > + schedule_timeout_killable(1); > } > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2836bc9..23d1769 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3438,26 +3438,26 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) > > *did_some_progress = 0; > > - /* > - * Acquire the oom lock. If that fails, somebody else is > - * making progress for us. > - */ > - if (!mutex_trylock(&oom_lock)) { > + if (mutex_lock_killable(&oom_lock)) { > *did_some_progress = 1; > - schedule_timeout_uninterruptible(1); > return NULL; > } > > /* > - * Go through the zonelist yet one more time, keep very high watermark > - * here, this is only to catch a parallel oom killing, we must fail if > - * we're still under heavy pressure. But make sure that this reclaim > - * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY > - * allocation which will never fail due to oom_lock already held. > + * This allocation attempt must not depend on __GFP_DIRECT_RECLAIM && > + * !__GFP_NORETRY allocation which will never fail due to oom_lock > + * already held. > + * > + * Since neither the OOM reaper nor exit_mmap() waits for oom_lock when > + * setting MMF_OOM_SKIP on the OOM victim's mm, we might needlessly > + * select more OOM victims if we use ALLOC_WMARK_HIGH here. But since > + * this allocation attempt does not sleep, we will not fail to invoke > + * the OOM killer even if we choose ALLOC_WMARK_MIN here. Thus, we use > + * ALLOC_WMARK_MIN here. > */ > page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) & > ~__GFP_DIRECT_RECLAIM, order, > - ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac); > + ALLOC_WMARK_MIN | ALLOC_CPUSET, ac); > if (page) > goto out; > > @@ -4205,6 +4205,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > /* Retry as long as the OOM killer is making progress */ > if (did_some_progress) { > no_progress_loops = 0; > + /* > + * This schedule_timeout_*() serves as a guaranteed sleep for > + * PF_WQ_WORKER threads when __zone_watermark_ok() == false. > + */ > + if (!tsk_is_oom_victim(current)) > + schedule_timeout_uninterruptible(1); > goto retry; > } > > -- > 1.8.3.1 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>