On Sat, 21 Jul 2018, Tetsuo Handa wrote: > > diff --git a/mm/mmap.c b/mm/mmap.c > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -3066,25 +3066,27 @@ void exit_mmap(struct mm_struct *mm) > > if (unlikely(mm_is_oom_victim(mm))) { > > /* > > * Manually reap the mm to free as much memory as possible. > > - * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard > > - * this mm from further consideration. Taking mm->mmap_sem for > > - * write after setting MMF_OOM_SKIP will guarantee that the oom > > - * reaper will not run on this mm again after mmap_sem is > > - * dropped. > > - * > > * Nothing can be holding mm->mmap_sem here and the above call > > * to mmu_notifier_release(mm) ensures mmu notifier callbacks in > > * __oom_reap_task_mm() will not block. > > * > > + * This sets MMF_UNSTABLE to avoid racing with the oom reaper. > > * This needs to be done before calling munlock_vma_pages_all(), > > * which clears VM_LOCKED, otherwise the oom reaper cannot > > - * reliably test it. > > + * reliably test for it. If the oom reaper races with > > + * munlock_vma_pages_all(), this can result in a kernel oops if > > + * a pmd is zapped, for example, after follow_page_mask() has > > + * checked pmd_none(). > > */ > > mutex_lock(&oom_lock); > > __oom_reap_task_mm(mm); > > mutex_unlock(&oom_lock); > > I don't like holding oom_lock for full teardown of an mm, for an OOM victim's mm > might have multiple TB memory which could take long time. > This patch does not involve deltas for oom_lock here, it can certainly be changed on top of this patch. I'm not attempting to address any oom_lock issue here. It should pose no roadblock for you. I only propose this patch now since it fixes millions of processes being oom killed unnecessarily, it was in -mm before a NACK for the most trivial fixes that have now been squashed into it, and is actually tested. > > > > - set_bit(MMF_OOM_SKIP, &mm->flags); > > + /* > > + * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will > > + * guarantee that the oom reaper will not run on this mm again > > + * after mmap_sem is dropped. > > + */ > > down_write(&mm->mmap_sem); > > up_write(&mm->mmap_sem); > > } > > > > > -#define MAX_OOM_REAP_RETRIES 10 > > static void oom_reap_task(struct task_struct *tsk) > > { > > - int attempts = 0; > > struct mm_struct *mm = tsk->signal->oom_mm; > > > > - /* Retry the down_read_trylock(mmap_sem) a few times */ > > - while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm)) > > - schedule_timeout_idle(HZ/10); > > + /* > > + * If this mm has either been fully unmapped, or the oom reaper has > > + * given up on it, nothing left to do except drop the refcount. > > + */ > > + if (test_bit(MMF_OOM_SKIP, &mm->flags)) > > + goto drop; > > > > - if (attempts <= MAX_OOM_REAP_RETRIES || > > - test_bit(MMF_OOM_SKIP, &mm->flags)) > > - goto done; > > + /* > > + * If this mm has already been reaped, doing so again will not likely > > + * free additional memory. > > + */ > > + if (!test_bit(MMF_UNSTABLE, &mm->flags)) > > + oom_reap_task_mm(tsk, mm); > > This is still wrong. If preempted immediately after set_bit(MMF_UNSTABLE, &mm->flags) from > __oom_reap_task_mm() from exit_mmap(), oom_reap_task() can give up before reclaiming any memory. If there is a single thread holding onto the mm and has reached exit_mmap() and is in the process of starting oom reaping itself, there's no advantage to the oom reaper trying to oom reap it. The thread in exit_mmap() will take care of it, __oom_reap_task_mm() does not block and oom_free_timeout_ms allows for enough time for that memory freeing to occur. The oom reaper will not set MMF_OOM_SKIP until the timeout has expired. As I said before, you could make a case for extending the timeout once MMF_UNSTABLE has been set. It practice, we haven't encountered a case where that matters. But that's trivial to do if you would prefer.