Re: [PATCH] mm: drop mmap_sem before calling balance_dirty_pages() in write fault

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Tue, 24 Sep 2019 13:46:08 -0700

On Tue, Sep 24, 2019 at 03:42:38PM -0400, Johannes Weiner wrote:
> > I'm not a fan of moving file_update_time() to _before_ the
> > balance_dirty_pages call.
> 
> Can you elaborate why? If the filesystem has a page_mkwrite op, it
> will have already called file_update_time() before this function is
> entered. If anything, this change makes the sequence more consistent.

Oh, that makes sense.  I thought it should be updated after all the data
was written, but it probably doesn't make much difference.

> > Also, this is now the third place that needs
> > maybe_unlock_mmap_for_io, see
> > https://lore.kernel.org/linux-mm/20190917120852.x6x3aypwvh573kfa@box/
> 
> Good idea, I moved the helper to internal.h and converted to it.
> 
> I left the shmem site alone, though. It doesn't require the file
> pinning, so it shouldn't pointlessly bump the file refcount and
> suggest such a dependency - that could cost somebody later quite a bit
> of time trying to understand the code.

The problem for shmem is this:

                        spin_unlock(&inode->i_lock);
                        schedule();

                        spin_lock(&inode->i_lock);
                        finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
                        spin_unlock(&inode->i_lock);

While scheduled, the VMA can go away and the inode be reclaimed, making
this a use-after-free.  The initial suggestion was an increment on
the inode refcount, but since we already have a pattern which involves
pinning the file, I thought that was a better way to go.

> From: Johannes Weiner <jweiner@xxxxxx>
> Date: Wed, 8 May 2019 13:53:38 -0700
> Subject: [PATCH v2] mm: drop mmap_sem before calling balance_dirty_pages()
>  in write fault
> 
> One of our services is observing hanging ps/top/etc under heavy write
> IO, and the task states show this is an mmap_sem priority inversion:
> 
> A write fault is holding the mmap_sem in read-mode and waiting for
> (heavily cgroup-limited) IO in balance_dirty_pages():
> 
> [<0>] balance_dirty_pages+0x724/0x905
> [<0>] balance_dirty_pages_ratelimited+0x254/0x390
> [<0>] fault_dirty_shared_page.isra.96+0x4a/0x90
> [<0>] do_wp_page+0x33e/0x400
> [<0>] __handle_mm_fault+0x6f0/0xfa0
> [<0>] handle_mm_fault+0xe4/0x200
> [<0>] __do_page_fault+0x22b/0x4a0
> [<0>] page_fault+0x45/0x50
> [<0>] 0xffffffffffffffff
> 
> Somebody tries to change the address space, contending for the
> mmap_sem in write-mode:
> 
> [<0>] call_rwsem_down_write_failed_killable+0x13/0x20
> [<0>] do_mprotect_pkey+0xa8/0x330
> [<0>] SyS_mprotect+0xf/0x20
> [<0>] do_syscall_64+0x5b/0x100
> [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [<0>] 0xffffffffffffffff
> 
> The waiting writer locks out all subsequent readers to avoid lock
> starvation, and several threads can be seen hanging like this:
> 
> [<0>] call_rwsem_down_read_failed+0x14/0x30
> [<0>] proc_pid_cmdline_read+0xa0/0x480
> [<0>] __vfs_read+0x23/0x140
> [<0>] vfs_read+0x87/0x130
> [<0>] SyS_read+0x42/0x90
> [<0>] do_syscall_64+0x5b/0x100
> [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [<0>] 0xffffffffffffffff
> 
> To fix this, do what we do for cache read faults already: drop the
> mmap_sem before calling into anything IO bound, in this case the
> balance_dirty_pages() function, and return VM_FAULT_RETRY.
> 
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>

Reviewed-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>