On Fri 20-04-18 10:23:49, Michal Hocko wrote: > On Thu 19-04-18 12:34:53, David Rientjes wrote: [...] > > I understand the concern, but it's the difference between the victim > > getting stuck in exit_mmap() and actually taking a long time to free its > > memory in exit_mmap(). I don't have evidence of the former. > > I do not really want to repeat myself. The primary purpose of the oom > reaper is to provide a _guarantee_ of the forward progress. I do not > care whether there is any evidences. All I know that lock_page has > plethora of different dependencies and we cannot clearly state this is > safe so we _must not_ depend on it when setting MMF_OOM_SKIP. > > The way how the oom path was fragile and lockup prone based on > optimistic assumptions shouldn't be repeated. > > That being said, I haven't heard any actual technical argument about why > locking the munmap path is a wrong thing to do while the MMF_OOM_SKIP > dependency on the page_lock really concerns me so > > Nacked-by: Michal Hocko <mhocko@xxxxxxxx> > > If you want to keep the current locking protocol then you really have to > make sure that the oom reaper will set MMF_OOM_SKIP when racing with > exit_mmap. So here is my suggestion for the fix. >From 37ab85d619f508ceaf829e57648a3d986c6d8bfc Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@xxxxxxxx> Date: Fri, 20 Apr 2018 13:53:08 +0200 Subject: [PATCH] oom: mm, oom: fix concurrent munlock and oom reaper unmap David has noticed that our current assumption that the oom reaper can race with exit_mmap is not correct. munlock_vma_pages_all() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl. Fix this by taking the exclusive mmap_sem in exit_mmap while tearing down the address space. Once that is done MMF_OOM_SKIP is set so that __oom_reap_task_mm can back off if it manages to take the read lock finally. Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Cc: stable Signed-off-by: Michal Hocko <mhocko@xxxxxxxx> --- mm/mmap.c | 36 ++++++++++++++++++------------------ mm/oom_kill.c | 2 +- 2 files changed, 19 insertions(+), 19 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index faf85699f1a1..216efa6d9f61 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3004,10 +3004,21 @@ void exit_mmap(struct mm_struct *mm) struct mmu_gather tlb; struct vm_area_struct *vma; unsigned long nr_accounted = 0; + bool locked = false; /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); + /* + * The mm is not accessible for anybody except for the oom reaper + * which cannot race with munlocking so make sure we exclude the + * two. + */ + if (unlikely(mm_is_oom_victim(mm))) { + down_write(&mm->mmap_sem); + locked = true; + } + if (mm->locked_vm) { vma = mm->mmap; while (vma) { @@ -3021,7 +3032,7 @@ void exit_mmap(struct mm_struct *mm) vma = mm->mmap; if (!vma) /* Can happen if dup_mmap() received an OOM */ - return; + goto out_unlock; lru_add_drain(); flush_cache_mm(mm); @@ -3030,23 +3041,6 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - if (unlikely(mm_is_oom_victim(mm))) { - /* - * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before - * calling down_read(), oom_reap_task() will not run - * on this "mm" post up_write(). - * - * mm_is_oom_victim() cannot be set from under us - * either because victim->mm is already set to NULL - * under task_lock before calling mmput and oom_mm is - * set not NULL by the OOM killer only if victim->mm - * is found not NULL while holding the task_lock. - */ - set_bit(MMF_OOM_SKIP, &mm->flags); - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); - } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); @@ -3060,6 +3054,12 @@ void exit_mmap(struct mm_struct *mm) vma = remove_vma(vma); } vm_unacct_memory(nr_accounted); + +out_unlock: + if (unlikely(locked)) { + set_bit(MMF_OOM_SKIP, &mm->flags); + up_write(&mm->mmap_sem); + } } /* Insert vm structure into process list sorted by address diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dfd370526909..94d26ef2c3c7 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -524,7 +524,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't * work on the mm anymore. The check for MMF_OOM_SKIP must run * under mmap_sem for reading because it serializes against the - * down_write();up_write() cycle in exit_mmap(). + * exit_mmap(). */ if (test_bit(MMF_OOM_SKIP, &mm->flags)) { up_read(&mm->mmap_sem); -- 2.16.3 -- Michal Hocko SUSE Labs