On Tue, Nov 16, 2021 at 1:57 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > oom-reaper and process_mrelease system call should protect against > races with exit_mmap which can destroy page tables while they > walk the VMA tree. oom-reaper protects from that race by setting > MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP > before taking and releasing mmap_write_lock. process_mrelease has > to elevate mm->mm_users to prevent such race. Both oom-reaper and > process_mrelease hold mmap_read_lock when walking the VMA tree. > The locking rules and mechanisms could be simpler if exit_mmap takes > mmap_write_lock while executing destructive operations such as > free_pgtables. > Change exit_mmap to hold the mmap_write_lock when calling > free_pgtables. Operations like unmap_vmas() and unlock_range() are not > destructive and could run under mmap_read_lock but for simplicity we > take one mmap_write_lock during almost the entire operation. Note > also that because oom-reaper checks VM_LOCKED flag, unlock_range() > should not be allowed to race with it. > In most cases this lock should be uncontended. Previously, Kirill > reported ~4% regression caused by a similar change [1]. We reran the > same test and although the individual results are quite noisy, the > percentiles show lower regression with 1.6% being the worst case [2]. > The change allows oom-reaper and process_mrelease to execute safely > under mmap_read_lock without worries that exit_mmap might destroy page > tables from under them. > > [1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@xxxxxxxxxxxxxxxxxx/ > [2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@xxxxxxxxxxxxxx/ Friendly nudge. Michal, Matthew, from our discussion in https://lore.kernel.org/all/YXKhOKIIngIuJaYi@xxxxxxxxxxxxxxxxxxxx I was under the impression this change would be interesting for you. Any feedback? > > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> > --- > mm/mmap.c | 11 +++++++---- > 1 file changed, 7 insertions(+), 4 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index bfb0ea164a90..69b3036c6dee 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3142,25 +3142,27 @@ void exit_mmap(struct mm_struct *mm) > * to mmu_notifier_release(mm) ensures mmu notifier callbacks in > * __oom_reap_task_mm() will not block. > * > - * This needs to be done before calling munlock_vma_pages_all(), > + * This needs to be done before calling unlock_range(), > * which clears VM_LOCKED, otherwise the oom reaper cannot > * reliably test it. > */ > (void)__oom_reap_task_mm(mm); > > set_bit(MMF_OOM_SKIP, &mm->flags); > - mmap_write_lock(mm); > - mmap_write_unlock(mm); > } > > + mmap_write_lock(mm); > if (mm->locked_vm) > unlock_range(mm->mmap, ULONG_MAX); > > arch_exit_mmap(mm); > > vma = mm->mmap; > - if (!vma) /* Can happen if dup_mmap() received an OOM */ > + if (!vma) { > + /* Can happen if dup_mmap() received an OOM */ > + mmap_write_unlock(mm); > return; > + } > > lru_add_drain(); > flush_cache_mm(mm); > @@ -3170,6 +3172,7 @@ void exit_mmap(struct mm_struct *mm) > unmap_vmas(&tlb, vma, 0, -1); > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > tlb_finish_mmu(&tlb); > + mmap_write_unlock(mm); > > /* > * Walk the list again, actually closing and freeing it, > -- > 2.34.0.rc1.387.gb447b232ab-goog >