Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait

David Howells <dhowells@xxxxxxxxxx> · Thu, 07 Mar 2024 10:36:28 +0000

Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:

> On Wed, Mar 06, 2024 at 10:39:37PM +0000, David Howells wrote:
> > Here's a patch to have a go at getting rid of ->launder_folio().  Since it's
> > failable and cannot guarantee that pages in the range are removed, I've tried
> > to replace laundering with just flush-and-wait, dropping the folio lock around
> > the I/O.
> 
> My sense is that ->launder_folio doesn't actually need to be replaced.
> 
> commit e3db7691e9f3dff3289f64e3d98583e28afe03db
> Author: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
> Date:   Wed Jan 10 23:15:39 2007 -0800
> 
>     [PATCH] NFS: Fix race in nfs_release_page()
> 
>         NFS: Fix race in nfs_release_page()
> 
>         invalidate_inode_pages2() may find the dirty bit has been set on a page
>         owing to the fact that the page may still be mapped after it was locked.
>         Only after the call to unmap_mapping_range() are we sure that the page
>         can no longer be dirtied.
>         In order to fix this, NFS has hooked the releasepage() method and tries
>         to write the page out between the call to unmap_mapping_range() and the
>         call to remove_mapping(). This, however leads to deadlocks in the page
>         reclaim code, where the page may be locked without holding a reference
>         to the inode or dentry.
> 
>         Fix is to add a new address_space_operation, launder_page(), which will
>         attempt to write out a dirty page without releasing the page lock.
> 
>     Signed-off-by: Trond Myklebust <Trond.Myklebust@xxxxxxxxxx>
> 
> I don't understand why this couldn't've been solved by page_mkwrite.
> NFS did later add nfs_vm_page_mkwrite in July 2007, and maybe it's just
> not needed any more?  I haven't looked into it enough to make sure,
> but my belief is that we should be able to get rid of it.

Okay, it's slightly more complex than I thought - and I'm not sure all callers
are actually using it correctly.  There are some additional interesting cases
I've found, beyond the pre-/post-DIO case:

 (1) NFS relies on it to retry the write before stripping the pages in the
     case where a writeback error occurs.  I think this can probably be dealt
     with by sticking a filemap_fdatawrite() call before the invalidation.
     I'm not sure if this would incur the deadlock with the page reclaim code
     of which Trond speaks.

 (2) invalidate_inode_pages2() is used in some places to effect invalidation
     of the pagecache in the case where the server tells us that a third party
     modified the server copy of a file.  What the right behaviour should be
     here, I'm not sure, but at the moment, any dirty data will get laundered
     back to the server.  Possibly it should be simply invalidated locally or
     the user asked how they want to handle the divergence.

     Some filesystems use invalidate_remote_inode() instead which seems to
     leave the dirty pages in place locally.

     If it is desirous to save the dirty data, then filemap_fdatawrite()
     could be deployed before invalidating the pages.

 (3) Fuse uses invalidate_inode_pages2() in fuse_do_setattr() to strip all the
     pages from an inode that has had its size changed, laundering any page
     that's still dirty.  This would seem to be excessive, but maybe Miklos
     had a reason for doing it that way.

There are some places that should perhaps be using kiocb_invalidate_pages()
and kiocb_invalidate_post_direct_write() instead - assuming Christoph has no
objection to the latter function being exported.

David