On Wed 26-07-17 18:29:12, Andrea Arcangeli wrote: > On Wed, Jul 26, 2017 at 07:45:57AM +0200, Michal Hocko wrote: > > On Tue 25-07-17 21:19:52, Andrea Arcangeli wrote: > > > On Tue, Jul 25, 2017 at 06:04:00PM +0200, Michal Hocko wrote: > > > > - down_write(&mm->mmap_sem); > > > > + if (tsk_is_oom_victim(current)) > > > > + down_write(&mm->mmap_sem); > > > > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > > > > tlb_finish_mmu(&tlb, 0, -1); > > > > > > > > @@ -3012,7 +3014,8 @@ void exit_mmap(struct mm_struct *mm) > > > > } > > > > mm->mmap = NULL; > > > > vm_unacct_memory(nr_accounted); > > > > - up_write(&mm->mmap_sem); > > > > + if (tsk_is_oom_victim(current)) > > > > + up_write(&mm->mmap_sem); > > > > > > How is this possibly safe? mark_oom_victim can run while exit_mmap is > > > running. > > > > I believe it cannot. We always call mark_oom_victim (on !current) with > > task_lock held and check task->mm != NULL and we call do_exit->mmput after > > mm is set to NULL under the same lock. > > Holding the mmap_sem for writing and setting mm->mmap to NULL to > filter which tasks already released the mmap_sem for writing post > free_pgtables still look unnecessary to solve this. > > Using MMF_OOM_SKIP as flag had side effects of oom_badness() skipping > it, but we can use the same tsk_is_oom_victim instead and relay on the > locking in mark_oom_victim you pointed out above instead of the > test_and_set_bit of my patch, because current->mm is already NULL at > that point. > > A race at the light of the above now is, because current->mm is NULL by the > time mmput is called, how can you start the oom_reap_task on a process > with current->mm NULL that called the last mmput and is blocked > in exit_aio? Because we have that mm available. See tsk->signal->oom_mm in oom_reap_task > It looks like no false positive can get fixed until this > is solved first because > > Isn't this enough? If this is enough it avoids other modification to > the exit_mmap runtime that looks unnecessary: mm->mmap = NULL replaced > by MMF_OOM_SKIP that has to be set anyway by __mmput later and one > unnecessary branch to call the up_write. > [...] > diff --git a/mm/mmap.c b/mm/mmap.c > index f19efcf75418..bdab595ce25c 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2993,6 +2993,23 @@ void exit_mmap(struct mm_struct *mm) > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > unmap_vmas(&tlb, vma, 0, -1); > > + set_bit(MMF_OOM_SKIP, &mm->flags); > + if (tsk_is_oom_victim(current)) { > + /* > + * Wait for oom_reap_task() to stop working on this > + * mm. Because MMF_OOM_SKIP is already set before > + * calling down_read(), oom_reap_task() will not run > + * on this "mm" post up_write(). > + * > + * tsk_is_oom_victim() cannot be set from under us > + * either because current->mm is already set to NULL > + * under task_lock before calling mmput and oom_mm is > + * set not NULL by the OOM killer only if current->mm > + * is found not NULL while holding the task_lock. > + */ > + down_write(&mm->mmap_sem); > + up_write(&mm->mmap_sem); > + } > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > tlb_finish_mmu(&tlb, 0, -1); Yes this will work and it won't depend on the oom_lock. But isn't it just more ugly than simply doing if (tsk_is_oom_victim) { down_write(&mm->mmap_sem); locked = true; } free_pgtables(...) [...] if (locked) down_up(&mm->mmap_sem); in general I do not like empty locked sections much, to be honest. Now with the conditional locking my patch looks as follows. It should be pretty much equivalent to your patch. Would that be acceptable for you or do you think there is a strong reason to go with yours? --- >From 2198654be88d11efb1f372e8579761f65e219206 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@xxxxxxxx> Date: Thu, 27 Jul 2017 08:48:15 +0200 Subject: [PATCH] mm, oom: allow oom reaper to race with exit_mmap David has noticed that the oom killer might kill additional tasks while the exiting oom victim hasn't terminated yet because the oom_reaper marks the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down to 0. The race is as follows oom_reap_task do_exit exit_mm __oom_reap_task_mm mmput __mmput mmget_not_zero # fails exit_mmap # frees memory set_bit(MMF_OOM_SKIP) The victim is still visible to the OOM killer until it is unhashed. Currently we try to reduce a risk of this race by taking oom_lock and wait for out_of_memory sleep while holding the lock to give the victim some time to exit. This is quite suboptimal approach because there is no guarantee the victim (especially a large one) will manage to unmap its address space and free enough memory to the particular oom domain which needs a memory (e.g. a specific NUMA node). Fix this problem by allowing __oom_reap_task_mm and __mmput path to race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed to run in parallel with other unmappers (hence the mmap_sem for read). The only tricky part is to exclude page tables tear down and all operations which modify the address space in the __mmput path. exit_mmap doesn't expect any other users so it doesn't use any locking. Nothing really forbids us to use mmap_sem for write, though. In fact we are already relying on this lock earlier in the __mmput path to synchronize with ksm and khugepaged. Take the exclusive mmap_sem when calling free_pgtables and destroying vmas to sync with __oom_reap_task_mm which take the lock for read. All other operations can safely race with the parallel unmap. Changes v1 - bail on null mm->mmap early as per David Rientjes - take exclusive mmap_sem in exit_mmap only for oom victims to reduce the lock overhead Reported-by: David Rientjes <rientjes@xxxxxxxxxx> Fixes: 26db62f179d1 ("oom: keep mm of the killed task available") Signed-off-by: Michal Hocko <mhocko@xxxxxxxx> --- mm/mmap.c | 16 ++++++++++++++++ mm/oom_kill.c | 47 ++++++++--------------------------------------- 2 files changed, 24 insertions(+), 39 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 24e9261bdcc0..822e8860b9d2 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -44,6 +44,7 @@ #include <linux/userfaultfd_k.h> #include <linux/moduleparam.h> #include <linux/pkeys.h> +#include <linux/oom.h> #include <linux/uaccess.h> #include <asm/cacheflush.h> @@ -2967,6 +2968,7 @@ void exit_mmap(struct mm_struct *mm) struct mmu_gather tlb; struct vm_area_struct *vma; unsigned long nr_accounted = 0; + bool locked = false; /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); @@ -2993,6 +2995,17 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); + /* + * oom reaper might race with exit_mmap so make sure we won't free + * page tables or unmap VMAs under its feet + * Please note that mark_oom_victim is always called under task_lock + * with tsk->mm != NULL checked on !current tasks which synchronizes + * with exit_mm and so we cannot race here. + */ + if (tsk_is_oom_victim(current)) { + down_write(&mm->mmap_sem); + locked = true; + } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); @@ -3005,7 +3018,10 @@ void exit_mmap(struct mm_struct *mm) nr_accounted += vma_pages(vma); vma = remove_vma(vma); } + mm->mmap = NULL; vm_unacct_memory(nr_accounted); + if (locked) + up_write(&mm->mmap_sem); } /* Insert vm structure into process list sorted by address diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 9e8b4f030c1c..b1c96e1910f2 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -470,40 +470,15 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { struct mmu_gather tlb; struct vm_area_struct *vma; - bool ret = true; - - /* - * We have to make sure to not race with the victim exit path - * and cause premature new oom victim selection: - * __oom_reap_task_mm exit_mm - * mmget_not_zero - * mmput - * atomic_dec_and_test - * exit_oom_victim - * [...] - * out_of_memory - * select_bad_process - * # no TIF_MEMDIE task selects new victim - * unmap_page_range # frees some memory - */ - mutex_lock(&oom_lock); if (!down_read_trylock(&mm->mmap_sem)) { - ret = false; trace_skip_task_reaping(tsk->pid); - goto unlock_oom; + return false; } - /* - * increase mm_users only after we know we will reap something so - * that the mmput_async is called only when we have reaped something - * and delayed __mmput doesn't matter that much - */ - if (!mmget_not_zero(mm)) { - up_read(&mm->mmap_sem); - trace_skip_task_reaping(tsk->pid); - goto unlock_oom; - } + /* There is nothing to reap so bail out without signs in the log */ + if (!mm->mmap) + goto unlock; trace_start_task_reaping(tsk->pid); @@ -540,18 +515,12 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) K(get_mm_counter(mm, MM_ANONPAGES)), K(get_mm_counter(mm, MM_FILEPAGES)), K(get_mm_counter(mm, MM_SHMEMPAGES))); - up_read(&mm->mmap_sem); - /* - * Drop our reference but make sure the mmput slow path is called from a - * different context because we shouldn't risk we get stuck there and - * put the oom_reaper out of the way. - */ - mmput_async(mm); trace_finish_task_reaping(tsk->pid); -unlock_oom: - mutex_unlock(&oom_lock); - return ret; +unlock: + up_read(&mm->mmap_sem); + + return true; } #define MAX_OOM_REAP_RETRIES 10 -- 2.13.2 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>