Michal Hocko wrote: > On Fri 16-06-17 09:54:34, Tetsuo Handa wrote: > [...] > > And the patch you proposed is broken. > > Thanks for your testing! > > > ---------- > > [ 161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child > > [ 161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB > > [ 161.858503] ------------[ cut here ]------------ > > [ 161.861512] kernel BUG at mm/memory.c:1381! > > BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is > going on here. > __oom_reap_task_mm exit_mmap > free_pgtables > up_write(mm->mmap_sem) > down_read_trylock(&mm->mmap_sem) > remove_vma > unmap_page_range > > So we need to extend the mmap_sem coverage. See the updated diff (not > the full proper patch yet). That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling unmap_page_range() when __mmput() already called exit_mm(), by setting/checking MMF_OOM_SKIP like shown below. diff --git a/kernel/fork.c b/kernel/fork.c index e53770d..5ef715c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -902,6 +902,11 @@ static inline void __mmput(struct mm_struct *mm) exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ + /* + * oom reaper might race with exit_mmap so make sure we won't free + * page tables under its feet + */ + down_write(&mm->mmap_sem); exit_mmap(mm); mm_put_huge_zero_page(mm); set_mm_exe_file(mm, NULL); @@ -913,6 +918,7 @@ static inline void __mmput(struct mm_struct *mm) if (mm->binfmt) module_put(mm->binfmt->module); set_bit(MMF_OOM_SKIP, &mm->flags); + up_write(&mm->mmap_sem); mmdrop(mm); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 04c9143..98cca19 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -493,12 +493,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) goto unlock_oom; } - /* - * increase mm_users only after we know we will reap something so - * that the mmput_async is called only when we have reaped something - * and delayed __mmput doesn't matter that much - */ - if (!mmget_not_zero(mm)) { + if (test_bit(MMF_OOM_SKIP, &mm->flags)) { up_read(&mm->mmap_sem); goto unlock_oom; } @@ -537,13 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) K(get_mm_counter(mm, MM_FILEPAGES)), K(get_mm_counter(mm, MM_SHMEMPAGES))); up_read(&mm->mmap_sem); - - /* - * Drop our reference but make sure the mmput slow path is called from a - * different context because we shouldn't risk we get stuck there and - * put the oom_reaper out of the way. - */ - mmput_async(mm); unlock_oom: mutex_unlock(&oom_lock); return ret; > > > Please carefully consider the reason why there is VM_BUG_ON() in __mmput(), > > and clarify in your patch that what are possible side effects of racing > > uprobe_clear_state()/exit_aio()/ksm_exit()/exit_mmap() etc. with > > __oom_reap_task_mm() > > Yes that definitely needs to be checked. We basically rely on the racing > part of the __mmput to not modify the address space. oom_reaper doesn't > touch any vma state except it unmaps pages which can run in parallel. > exit_aio->kill_ioctx seemingly does vm_munmap but it a) uses the > mmap_sem for write and b) it doesn't actually unmap because exit_aio > does ctx->mmap_size = 0. {ksm,khugepaged}_exit just do some houskeeping > which is not modifying the address space. I hope I will find some more > time to work on this next week. Additional test would be highly > appreciated of course. Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be safe. But ksm_exit() part might interfere. If it is guaranteed to be safe, what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc. to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>