Michal Hocko wrote: > On Fri 01-06-18 00:23:57, Tetsuo Handa wrote: > > On 2018/05/31 19:44, Michal Hocko wrote: > > > On Thu 31-05-18 19:10:48, Tetsuo Handa wrote: > > >> On 2018/05/30 8:07, Andrew Morton wrote: > > >>> On Tue, 29 May 2018 09:17:41 +0200 Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > >>> > > >>>>> I suggest applying > > >>>>> this patch first, and then fix "mm, oom: cgroup-aware OOM killer" patch. > > >>>> > > >>>> Well, I hope the whole pile gets merged in the upcoming merge window > > >>>> rather than stall even more. > > >>> > > >>> I'm more inclined to drop it all. David has identified significant > > >>> shortcomings and I'm not seeing a way of addressing those shortcomings > > >>> in a backward-compatible fashion. Therefore there is no way forward > > >>> at present. > > >>> > > >> > > >> Can we apply my patch as-is first? > > > > > > No. As already explained before. Sprinkling new sleeps without a strong > > > reason is not acceptable. The issue you are seeing is pretty artificial > > > and as such doesn're really warrant an immediate fix. We should rather > > > go with a well thought trhough fix. In other words we should simply drop > > > the sleep inside the oom_lock for starter unless it causes some really > > > unexpected behavior change. > > > > > > > The OOM killer did not require schedule_timeout_killable(1) to return > > as long as the OOM victim can call __mmput(). But now the OOM killer > > requires schedule_timeout_killable(1) to return in order to allow the > > OOM victim to call __oom_reap_task_mm(). Thus, this is a regression. > > > > Artificial cannot become the reason to postpone my patch. If we don't care > > artificialness/maliciousness, we won't need to care Spectre/Meltdown bugs. > > > > I'm not sprinkling new sleeps. I'm just merging existing sleeps (i.e. > > mutex_trylock() case and !mutex_trylock() case) and updating the outdated > > comments. > > Sigh. So what exactly is wrong with going simple and do > http://lkml.kernel.org/r/20180528124313.GC27180@xxxxxxxxxxxxxx ? > Because (1) You are trying to apply this fix after Roman's patchset which Andrew Morton is more inclined to drop. (2) You are making this fix difficult to backport because this patch depends on Roman's patchset. (3) You are not fixing the bug in Roman's patchset. (4) You are not updating the outdated comment in my patch and Roman's patchset. (5) You are not evaluating the side effect of not sleeping outside of the OOM path, despite you said > If we _really_ need to touch should_reclaim_retry then > it should be done in a separate patch with some numbers/tracing > data backing that story. and I consider that "whether moving the short sleep to should_reclaim_retry() has negative impact" and "whether eliminating the short sleep has negative impact" should be evaluated in a separate patch. but I will tolerate below patch if you can accept below patch "as-is" (because it explicitly notes what actually happens and there might be unexpected side effect of not sleeping outside of the OOM path). >From 4f8dbcd2a7edde2dddf4b1bd364bb2fa930b0819 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> Date: Fri, 1 Jun 2018 09:35:01 +0900 Subject: [PATCH v2] mm, oom: remove sleep from under oom_lock Tetsuo has pointed out that since commit 27ae357fa82be5ab ("mm, oom: fix concurrent munlock and oom reaper unmap, v3") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory() doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with a few dozens of processes can block the task holding the oom_lock `forever', and the system gets completely locked-up because neither the OOM victims nor the OOM reaper can make any progress. Drop the short sleep from out_of_memory() when we hold the lock (without ever examining the side effect of not sleeping without the lock). But keep the short sleep when the mutex_trylock() fails in order to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Reported-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> Signed-off-by: Michal Hocko <mhocko@xxxxxxxx> --- mm/oom_kill.c | 58 ++++++++++++++++++++++++---------------------------------- 1 file changed, 24 insertions(+), 34 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 565e7da..7940c53 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -357,7 +357,8 @@ int oom_evaluate_task(struct task_struct *task, void *arg) /* * Simple selection loop. We choose the process with the highest number of - * 'points'. In case scan was aborted, oc->chosen_task is set to -1. + * 'points'. In case scan was aborted, oc->chosen_task is set to + * INFLIGHT_VICTIM. */ static void select_bad_process(struct oom_control *oc) { @@ -499,6 +500,21 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); +/* + * We have to make sure not to cause premature new oom victim selection. + * + * __alloc_pages_may_oom() oom_reap_task_mm()/exit_mmap() + * mutex_trylock(&oom_lock) + * get_page_from_freelist(ALLOC_WMARK_HIGH) # fails + * unmap_page_range() # frees some memory + * set_bit(MMF_OOM_SKIP) + * out_of_memory() + * select_bad_process() + * test_bit(MMF_OOM_SKIP) # selects new oom victim + * mutex_unlock(&oom_lock) + * + * Therefore, the callers hold oom_lock when calling this function. + */ void __oom_reap_task_mm(struct mm_struct *mm) { struct vm_area_struct *vma; @@ -543,20 +559,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { bool ret = true; - /* - * We have to make sure to not race with the victim exit path - * and cause premature new oom victim selection: - * oom_reap_task_mm exit_mm - * mmget_not_zero - * mmput - * atomic_dec_and_test - * exit_oom_victim - * [...] - * out_of_memory - * select_bad_process - * # no TIF_MEMDIE task selects new victim - * unmap_page_range # frees some memory - */ mutex_lock(&oom_lock); if (!down_read_trylock(&mm->mmap_sem)) { @@ -1099,7 +1101,6 @@ bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; - bool delay = false; /* if set, delay next allocation attempt */ if (oom_killer_disabled) return false; @@ -1149,32 +1150,21 @@ bool out_of_memory(struct oom_control *oc) return true; } - if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) { - delay = true; - goto out; - } + if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) + return true; select_bad_process(oc); /* Found nothing?!?! Either we hang forever, or we panic. */ - if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { + if (!oc->chosen_task) { + if (is_sysrq_oom(oc) || is_memcg_oom(oc)) + return false; dump_header(oc, NULL); panic("Out of memory and no killable processes...\n"); } - if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) { + if (oc->chosen_task != INFLIGHT_VICTIM) oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : "Memory cgroup out of memory"); - delay = true; - } - -out: - /* - * Give the killed process a good chance to exit before trying - * to allocate memory again. - */ - if (delay) - schedule_timeout_killable(1); - - return !!oc->chosen_task; + return true; } /* -- 1.8.3.1