On Tue, 24 Apr 2018, Tetsuo Handa wrote: > > > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before > > > exit_mmap() holds mmap_sem for write. Then, at least memory which could > > > have been reclaimed if exit_mmap() did not hold mmap_sem for write will > > > be guaranteed to be reclaimed before MMF_OOM_SKIP is set. > > > > > > > I think that's an exceptionally good idea and will mitigate the concerns > > of others. > > > > It can be done without holding mm->mmap_sem in exit_mmap() and uses the > > same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we > > don't get dozens of unnecessary oom kills. > > > > What do you think about this? It passes preliminary testing on powerpc > > and I'm enqueued it for much more intensive testing. (I'm wishing there > > was a better way to acknowledge your contribution to fixing this issue, > > especially since you brought up the exact problem this is addressing in > > previous emails.) > > > > I don't think this patch is safe, for exit_mmap() is calling > mmu_notifier_invalidate_range_{start,end}() which might block with oom_lock > held when oom_reap_task_mm() is waiting for oom_lock held by exit_mmap(). One of the reasons that I extracted __oom_reap_task_mm() out of the new oom_reap_task_mm() is to avoid the checks that would be unnecessary when called from exit_mmap(). In this case, we can ignore the mm_has_blockable_invalidate_notifiers() check because exit_mmap() has already done mmu_notifier_release(). So I don't think there's a concern about __oom_reap_task_mm() blocking while holding oom_lock. Unless you are referring to something else? > exit_mmap() must not block while holding oom_lock in order to guarantee that > oom_reap_task_mm() can give up. > > Some suggestion on top of your patch: > > mm/mmap.c | 13 +++++-------- > mm/oom_kill.c | 51 ++++++++++++++++++++++++++------------------------- > 2 files changed, 31 insertions(+), 33 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 981eed4..7b31357 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3019,21 +3019,18 @@ void exit_mmap(struct mm_struct *mm) > /* > * Manually reap the mm to free as much memory as possible. > * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this > - * mm from further consideration. Taking mm->mmap_sem for write > - * after setting MMF_OOM_SKIP will guarantee that the oom reaper > - * will not run on this mm again after mmap_sem is dropped. > + * mm from further consideration. Setting MMF_OOM_SKIP under > + * oom_lock held will guarantee that the OOM reaper will not > + * run on this mm again. > * > * This needs to be done before calling munlock_vma_pages_all(), > * which clears VM_LOCKED, otherwise the oom reaper cannot > * reliably test it. > */ > - mutex_lock(&oom_lock); > __oom_reap_task_mm(mm); > - mutex_unlock(&oom_lock); > - > + mutex_lock(&oom_lock); > set_bit(MMF_OOM_SKIP, &mm->flags); > - down_write(&mm->mmap_sem); > - up_write(&mm->mmap_sem); > + mutex_unlock(&oom_lock); > } > > if (mm->locked_vm) { > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 8ba6cb8..9a29df8 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -523,21 +523,15 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > { > bool ret = true; > > + mutex_lock(&oom_lock); > + > /* > - * We have to make sure to not race with the victim exit path > - * and cause premature new oom victim selection: > - * oom_reap_task_mm exit_mm > - * mmget_not_zero > - * mmput > - * atomic_dec_and_test > - * exit_oom_victim > - * [...] > - * out_of_memory > - * select_bad_process > - * # no TIF_MEMDIE task selects new victim > - * unmap_page_range # frees some memory > + * MMF_OOM_SKIP is set by exit_mmap() when the OOM reaper can't > + * work on the mm anymore. The check for MMF_OOM_SKIP must run > + * under oom_lock held. > */ > - mutex_lock(&oom_lock); > + if (test_bit(MMF_OOM_SKIP, &mm->flags)) > + goto unlock_oom; > > if (!down_read_trylock(&mm->mmap_sem)) { > ret = false; > @@ -557,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > goto unlock_oom; > } > > - /* > - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't > - * work on the mm anymore. The check for MMF_OOM_SKIP must run > - * under mmap_sem for reading because it serializes against the > - * down_write();up_write() cycle in exit_mmap(). > - */ > - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { > - up_read(&mm->mmap_sem); > - trace_skip_task_reaping(tsk->pid); > - goto unlock_oom; > - } > - > trace_start_task_reaping(tsk->pid); > > __oom_reap_task_mm(mm); > @@ -610,8 +592,27 @@ static void oom_reap_task(struct task_struct *tsk) > /* > * Hide this mm from OOM killer because it has been either reaped or > * somebody can't call up_write(mmap_sem). > + * > + * We have to make sure to not cause premature new oom victim selection: > + * > + * __alloc_pages_may_oom() oom_reap_task_mm()/exit_mmap() > + * mutex_trylock(&oom_lock) > + * get_page_from_freelist(ALLOC_WMARK_HIGH) # fails > + * unmap_page_range() # frees some memory > + * set_bit(MMF_OOM_SKIP) > + * out_of_memory() > + * select_bad_process() > + * test_bit(MMF_OOM_SKIP) # selects new oom victim > + * mutex_unlock(&oom_lock) > + * > + * Setting MMF_OOM_SKIP under oom_lock held will guarantee that the > + * last second alocation attempt is done by __alloc_pages_may_oom() > + * before out_of_memory() selects next OOM victim by finding > + * MMF_OOM_SKIP. > */ > + mutex_lock(&oom_lock); > set_bit(MMF_OOM_SKIP, &mm->flags); > + mutex_unlock(&oom_lock); > > /* Drop a reference taken by wake_oom_reaper */ > put_task_struct(tsk);