On Tue, Sep 24, 2019 at 03:42:38PM -0400, Johannes Weiner wrote: > > I'm not a fan of moving file_update_time() to _before_ the > > balance_dirty_pages call. > > Can you elaborate why? If the filesystem has a page_mkwrite op, it > will have already called file_update_time() before this function is > entered. If anything, this change makes the sequence more consistent. Oh, that makes sense. I thought it should be updated after all the data was written, but it probably doesn't make much difference. > > Also, this is now the third place that needs > > maybe_unlock_mmap_for_io, see > > https://lore.kernel.org/linux-mm/20190917120852.x6x3aypwvh573kfa@box/ > > Good idea, I moved the helper to internal.h and converted to it. > > I left the shmem site alone, though. It doesn't require the file > pinning, so it shouldn't pointlessly bump the file refcount and > suggest such a dependency - that could cost somebody later quite a bit > of time trying to understand the code. The problem for shmem is this: spin_unlock(&inode->i_lock); schedule(); spin_lock(&inode->i_lock); finish_wait(shmem_falloc_waitq, &shmem_fault_wait); spin_unlock(&inode->i_lock); While scheduled, the VMA can go away and the inode be reclaimed, making this a use-after-free. The initial suggestion was an increment on the inode refcount, but since we already have a pattern which involves pinning the file, I thought that was a better way to go. > From: Johannes Weiner <jweiner@xxxxxx> > Date: Wed, 8 May 2019 13:53:38 -0700 > Subject: [PATCH v2] mm: drop mmap_sem before calling balance_dirty_pages() > in write fault > > One of our services is observing hanging ps/top/etc under heavy write > IO, and the task states show this is an mmap_sem priority inversion: > > A write fault is holding the mmap_sem in read-mode and waiting for > (heavily cgroup-limited) IO in balance_dirty_pages(): > > [<0>] balance_dirty_pages+0x724/0x905 > [<0>] balance_dirty_pages_ratelimited+0x254/0x390 > [<0>] fault_dirty_shared_page.isra.96+0x4a/0x90 > [<0>] do_wp_page+0x33e/0x400 > [<0>] __handle_mm_fault+0x6f0/0xfa0 > [<0>] handle_mm_fault+0xe4/0x200 > [<0>] __do_page_fault+0x22b/0x4a0 > [<0>] page_fault+0x45/0x50 > [<0>] 0xffffffffffffffff > > Somebody tries to change the address space, contending for the > mmap_sem in write-mode: > > [<0>] call_rwsem_down_write_failed_killable+0x13/0x20 > [<0>] do_mprotect_pkey+0xa8/0x330 > [<0>] SyS_mprotect+0xf/0x20 > [<0>] do_syscall_64+0x5b/0x100 > [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [<0>] 0xffffffffffffffff > > The waiting writer locks out all subsequent readers to avoid lock > starvation, and several threads can be seen hanging like this: > > [<0>] call_rwsem_down_read_failed+0x14/0x30 > [<0>] proc_pid_cmdline_read+0xa0/0x480 > [<0>] __vfs_read+0x23/0x140 > [<0>] vfs_read+0x87/0x130 > [<0>] SyS_read+0x42/0x90 > [<0>] do_syscall_64+0x5b/0x100 > [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [<0>] 0xffffffffffffffff > > To fix this, do what we do for cache read faults already: drop the > mmap_sem before calling into anything IO bound, in this case the > balance_dirty_pages() function, and return VM_FAULT_RETRY. > > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> Reviewed-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>