On Wed 25-05-16 19:52:18, Tetsuo Handa wrote: > Michal Hocko wrote: > > > Just a random thought, but after this patch is applied, do we still need to use > > > a dedicated kernel thread for OOM-reap operation? If I recall correctly, the > > > reason we decided to use a dedicated kernel thread was that calling > > > down_read(&mm->mmap_sem) / mmput() from the OOM killer context is unsafe due to > > > dependency. By replacing mmput() with mmput_async(), since __oom_reap_task() will > > > no longer do operations that might block, can't we try OOM-reap operation from > > > current thread which called mark_oom_victim() or oom_scan_process_thread() ? > > > > I was already thinking about that. It is true that the main blocker > > was the mmput, as you say, but the dedicated kernel thread seems to be > > more robust locking and stack wise. So I would prefer staying with the > > current approach until we see that it is somehow limitting. One pid and > > kernel stack doesn't seem to be a terrible price to me. But as I've said > > I am not bound to the kernel thread approach... > > > > It seems to me that async OOM reaping widens race window for needlessly > selecting next OOM victim, for the OOM reaper holding a reference of a > TIF_MEMDIE thread's mm expedites clearing TIF_MEMDIE from that thread > by making atomic_dec_and_test() in mmput() from exit_mm() false. AFAIU you mean __oom_reap_task exit_mm atomic_inc_not_zero tsk->mm = NULL mmput atomic_dec_and_test # > 0 exit_oom_victim # New victim will be # selected <OOM killer invoked> # no TIF_MEMDIE task so we can select a new one unmap_page_range # to release the memory Previously we were kind of protected by PF_EXITING check in oom_scan_process_thread which is not there anymore. The race is possible even without the oom reaper because many other call sites might pin the address space and be preempted for an unbounded amount of time. We could widen the race window by reintroducing the check or moving exit_oom_victim later in do_exit after exit_notify which then removes the task from the task_list (in __unhash_process) so the OOM killer wouldn't see it anyway. Sounds ugly to me though. > Maybe we should wait for first OOM reap attempt from the OOM killer context > before releasing oom_lock mutex (sync OOM reaping) ? I do not think we want to wait inside the oom_lock as it is a global lock shared by all OOM killer contexts. Another option would be to use the oom_lock inside __oom_reap_task. It is not super cool either because now we have a dependency on the lock but looks like reasonably easy solution. --- diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5bb2f7698ad7..d0f42cc88f6a 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -450,6 +450,22 @@ static bool __oom_reap_task(struct task_struct *tsk) bool ret = true; /* + * We have to make sure to not race with the victim exit path + * and cause premature new oom victim selection: + * __oom_reap_task exit_mm + * atomic_inc_not_zero + * mmput + * atomic_dec_and_test + * exit_oom_victim + * [...] + * out_of_memory + * select_bad_process + * # no TIF_MEMDIE task select new victim + * unmap_page_range # frees some memory + */ + mutex_lock(&oom_lock); + + /* * Make sure we find the associated mm_struct even when the particular * thread has already terminated and cleared its mm. * We might have race with exit path so consider our work done if there @@ -457,19 +473,19 @@ static bool __oom_reap_task(struct task_struct *tsk) */ p = find_lock_task_mm(tsk); if (!p) - return true; + goto unlock_oom; mm = p->mm; if (!atomic_inc_not_zero(&mm->mm_users)) { task_unlock(p); - return true; + goto unlock_oom; } task_unlock(p); if (!down_read_trylock(&mm->mmap_sem)) { ret = false; - goto out; + goto unlock_oom; } tlb_gather_mmu(&tlb, mm, 0, -1); @@ -511,7 +527,8 @@ static bool __oom_reap_task(struct task_struct *tsk) * to release its memory. */ set_bit(MMF_OOM_REAPED, &mm->flags); -out: +unlock_oom: + mutex_unlock(&oom_lock); /* * Drop our reference but make sure the mmput slow path is called from a * different context because we shouldn't risk we get stuck there and -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>